[Question] Loss explodes instantly when fine-tuning Conformer-CTC on custom dataset #4421

sqrk · 2022-06-19T09:15:31Z

sqrk
Jun 19, 2022

Hello!

I'm new to ASR (AI in general) and I'm struggling to fine-tune the Conformer CTC (small) on the UASpeech dataset.

If I try to fine-tune the model from the NeMo pre-trained checkpoint, the loss starts around 40 and jumps to 4.14e+03 in the next step then some value around e+05 and stays in that range. Val_wer is nearly a constant 100%. It occasionally predicts long sequences of repeated characters (e.g. iiiiiiiiiiiiiiiiiiiiiiiiii). I'm wondering if I'm making some mistake when I change the vocabulary. I use the process_asr_text_tokenizer.py script giving 455 as the vocab size (the number of unique words in the dataset) to create the vocabulary then pass it to model.cfg["train_ds"]["manifest_filepath"]. I also set

model.cfg["decoder"]["vocabulary"] = []
model.cfg["decoder"]["num_classes"] = 455

Here's my model.cfg:

{'sample_rate': 16000, 
'log_prediction': True, 
'ctc_reduction': 'mean_batch', 
'train_ds': 
	{'manifest_filepath': path_to_custom_train_set, 
	'sample_rate': 16000, 
	'batch_size': 16, 
	'shuffle': True, 
	'num_workers': 8, 
	'pin_memory': True, 
	'use_start_end_token': False, 
	'trim_silence': False, 
	'max_duration': 16.7, 
	'min_duration': 0.1, 
	'is_tarred': False, 
	'tarred_audio_filepaths': None, 
	'shuffle_n': 2048, 
	'bucketing_strategy': 'synced_randomized', 
	'bucketing_batch_size': None}, 
'validation_ds': 
	{'manifest_filepath': path_to_custom_dataset_val_set, 
	'sample_rate': 16000, 
	'batch_size': 16, 
	'shuffle': False, 
	'num_workers': 8, 
	'pin_memory': True, 
	'use_start_end_token': False}, 
'test_ds': 
	{'manifest_filepath': ['/manifests/librispeech/librivox-dev-other.json', 
	'/manifests/librispeech/librivox-dev-clean.json', 
	'/manifests/librispeech/librivox-test-other.json', 
	'/manifests/librispeech/librivox-test-clean.json'], 
	'sample_rate': 16000, 
	'batch_size': 32, 
	'shuffle': False, 
	'num_workers': 8, 
	'pin_memory': True, 
	'use_start_end_token': False, 
	'is_tarred': False, 
	'tarred_audio_filepaths': ''}, 
'tokenizer': 
	{'dir': path_to_custom_tokenizer_spe_bpe, 
	'type': 'bpe'}, 
'preprocessor': 
	{'_target_': 'nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor', 
	'sample_rate': 16000, 
	'normalize': 'per_feature', 
	'window_size': 0.025, 
	'window_stride': 0.01, 
	'window': 'hann', 
	'features': 80, 
	'n_fft': 512, 
	'log': True, 
	'frame_splicing': 1, 
	'dither': 1e-05, 
	'pad_to': 0, 
	'pad_value': 0.0}, 
'spec_augment': 
	{'_target_': 'nemo.collections.asr.modules.SpectrogramAugmentation', 
	'freq_masks': 2, 
	'time_masks': 2, 
	'freq_width': 27, 
	'time_width': 100}, 
'encoder': 
	{'_target_': 'nemo.collections.asr.modules.ConformerEncoder', 
	'feat_in': 80, 
	'feat_out': -1, 
	'n_layers': 16, 
	'd_model': 176, 
	'subsampling': 'striding', 
	'subsampling_factor': 4, 
	'subsampling_conv_channels': 176, 
	'ff_expansion_factor': 4, 
	'self_attention_model': 'rel_pos', 
	'n_heads': 4, 
	'att_context_size': [-1, -1], 
	'xscaling': True, 
	'untie_biases': True, 
	'pos_emb_max_len': 5000, 
	'conv_kernel_size': 31, 
	'dropout': 0.1, 
	'dropout_emb': 0.0, 
	'dropout_att': 0.1}, 
'decoder': 
	{'_target_': 'nemo.collections.asr.modules.ConvASRDecoder', 
	'feat_in': 176, 
	'num_classes': 455, 
	'vocabulary': []}, 
'optim': 
	{'name': 'adamw', 
	'lr': 5, 
	'betas': [0.9, 
	0.98], 
	'weight_decay': 0, 
	'sched': 
		{'name': 'NoamAnnealing', 
		'd_model': 176, 
		'warmup_steps': 10000, 
		'warmup_ratio': None, 
		'min_lr': 1e-06}
	}, 
'init_weights_from_model': '/exp_init//ngc_checkpoints/small/ls_d176_adamlr2_wd0_aug5x0.05_wu10k_stridc176_ctc_2k_wpe128_untied_newrel.nemo', 
'target': 'nemo.collections.asr.models.ctc_bpe_models.EncDecCTCModelBPE'}

I'm thinking that there might be an issue with the init_weights_from_model having a different vocabulary size because when I try to perform evaluation on the checkpoint I fine-tuned (by calling model = nemo_asr.models.EncDecCTCModelBPE.load_from_checkpoint(path_to_checkpoint)), I get the following error where the mismatch corresponds with the old and new vocabulary sizes.

RuntimeError: Error(s) in loading state_dict for EncDecCTCModelBPE:
	size mismatch for decoder.decoder_layers.0.weight: copying a param with shape torch.Size([129, 176, 1]) from checkpoint, the shape in current model is torch.Size([456, 176, 1]).
	size mismatch for decoder.decoder_layers.0.bias: copying a param with shape torch.Size([129]) from checkpoint, the shape in current model is torch.Size([456]).

The loss explosion doesn't happen if I start training from scratch (using a config file). The loss starts around 400 but decreases rapidly to 11/12 within the first epoch then plateaus. After 60 epochs, the predictions still do not seem to improve (mostly random single letters), but maybe it just needs more training? Val_wer fluctuates between 99.9 and 100%.

Any idea as to what I'm doing wrong here? It's good to note that the dataset is of impaired speech but it doesn't seem to be responsible for the loss issue since it doesn't happen when the model is trained from scratch.
When you use the tokenizer script, is it typical to do it just for the train manifest or a union of both? I'm only doing it for the train set to have a good "in the wild" evaluation.
When you have a small dataset, do you typically still need 2 distinct val and test sets? I use the same one for evaluation since it doesn't affect training, and I've noticed that existing papers do not mention a third split.

Thank you so much in advance!

Answered by sqrk

Jun 24, 2022

After some investigation, I think that the exploding loss could have been caused by the vocabulary size being too large. I initially thought the vocab size was supposed to be the number of unique words in my dataset, but since we're using CTC it should be at least the number of characters in the dataset or more. And in order to use the pre-trained weights, the vocabulary size needs to match.

View full answer

ericharper · 2022-06-23T18:04:59Z

ericharper
Jun 23, 2022
Maintainer

@VahidooX could you take a look at this one?

0 replies

sqrk · 2022-06-24T07:54:03Z

sqrk
Jun 24, 2022
Author

After some investigation, I think that the exploding loss could have been caused by the vocabulary size being too large. I initially thought the vocab size was supposed to be the number of unique words in my dataset, but since we're using CTC it should be at least the number of characters in the dataset or more. And in order to use the pre-trained weights, the vocabulary size needs to match.

0 replies

VahidooX · 2022-06-24T17:49:38Z

VahidooX
Jun 24, 2022
Collaborator

What is init_weights_from_model you have used? I don't think we have this one in nemo, right?
If you want to load the weights from a pretrained model you need to use init_from_nemo_model. Here is the detail how to use it:
https://github.com/NVIDIA/NeMo/blob/main/nemo/core/classes/modelPT.py#L920

You need to add +init_from_nemo_model=model_nemo_file.nemo

If you have a very small dataset, it is suggested to use the same tokenizer as the pretrained model. You need to change the tokenizer mostly when you train it on another language. Anyway, when you change the token size, then the decoder weights cannot get loaded as they have different shapes. You can skip loading the weights by using exclude option like here:

#3944

What is the accuracy of the pretrained model on your dataset without fine-tuning? If you want to fine-tune on small dataset, I suggest to keep the same tokenizer. Nemo files are regular gzip file, and you may unzip the nemo file and get the tokenizer out. Use lower lr like lr=1 or lr=0.5 so that the model don't diverge significantly.

@titu1994 do we have details of how users can load these checkpoint in our docs?

6 replies

sqrk Jun 29, 2022
Author

@VahidooX I use the second way in the link you mentioned, which I think downloads a .nemo file (e.g. models.EncDecCTCModelBPE.from_pretrained(model_name="stt_en_conformer_ctc_small")).
Concerning init_weights_from_model, I don't specify it myself, so I'm assuming that it comes from loading the pre-trained model.

Oh, so I don't need to change the vocabulary at all if I'm using the same language. Thanks, I'll try that.

I get 254.26% WER before fine-tuning. I will try your lr suggestions. Thank you so much!

VahidooX Jun 29, 2022
Collaborator

Is 254.26% correct? Have you checked the output of the model (transcriptions) evaluated on this dataset? The pretrained model should give you reasonable results on the English speech.

titu1994 Jun 30, 2022
Maintainer

Can you check if your text is lower case characters only, no numbers no digits no special characters other than space, and apostrophe ? 256% wer is entirely incorrect for a pretrained ASR model.

sqrk Jun 30, 2022
Author

@VahidooX Yes, the transcriptions are incorrect. Here are a couple of examples:

[NeMo I 2022-06-29 21:08:56 wer_bpe:204] reference:but
[NeMo I 2022-06-29 21:08:56 wer_bpe:205] predicted:ft
[NeMo I 2022-06-29 21:08:56 wer_bpe:203] 
    
[NeMo I 2022-06-29 21:08:56 wer_bpe:204] reference:call
[NeMo I 2022-06-29 21:08:56 wer_bpe:205] predicted:if you call and 
[NeMo I 2022-06-29 21:08:56 wer_bpe:203] 
    
[NeMo I 2022-06-29 21:08:56 wer_bpe:204] reference:each
[NeMo I 2022-06-29 21:08:56 wer_bpe:205] predicted:the s s ed
[NeMo I 2022-06-29 21:08:56 wer_bpe:203] 
    
[NeMo I 2022-06-29 21:08:56 wer_bpe:204] reference:shift
[NeMo I 2022-06-29 21:08:56 wer_bpe:205] predicted:you ss
[NeMo I 2022-06-29 21:08:57 wer_bpe:203]

It did not surprise me to get such a high WER because I'm using impaired speech which is hard to understand even for humans. The current SOTA for the task is around 25%.

I did however try to evaluate the model on VCTK and it also gave a very high WER (132%). I'm not sure what I'm doing wrong. I uploaded my evaluation script here.

@titu1994 Yes, I lowercased everything, and removed all punctuation and special characters.

titu1994 Jun 30, 2022
Maintainer

We have prediction and evaluation (with wer measurement scripts from a dataset available here - https://github.com/NVIDIA/NeMo/tree/main/examples/asr)

I see you set the sample rate to 48k on the config, note that this does not change any thing for the model since you are only changing it's config. We recommend processing the audio to be 16 khz monochannel before passing to Nemo for correctness (though internationally it will auto downsample the audio to correct sample rate, but it's slow and CPU based).

Also there's no model.eval() after creating the model.
Otherwise I don't see anything problematic with the script. Potentially the audio is simply difficult so the model is unable to operate properly.

singhajeetzee · 2023-05-09T07:15:14Z

singhajeetzee
May 9, 2023

Hi @VahidooX, @titu1994 ,
As you said if we have limited data and we want to finetune in the same language so we need to use the existing tokenizer files(.vocab and .model).

As new data has some new words so-
1 - How can we tokenize the new data with existing tokenizer (which is used to tokenize the existing model data)
2 - And after tokenizing we will get some new tokens, so how can we append/merge the new tokens with existing .vacb and .model files data?

Please suggest me, as I have limited data and wanted to finetune the model in same language.

1 reply

titu1994 May 9, 2023
Maintainer

New words are not an issue - as long as there aren't special tokens that are not space, comma, apostrophe or lower case English letters. Words themselves can still be Tokenized without issue.

There is no way to merge or update Sentencepiece Tokenizers. So you'll have to build one from scratch if you cannot process your data to suite the tokenizer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Loss explodes instantly when fine-tuning Conformer-CTC on custom dataset #4421

{{title}}

Replies: 4 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[Question] Loss explodes instantly when fine-tuning Conformer-CTC on custom dataset #4421

sqrk Jun 19, 2022

Replies: 4 comments · 7 replies

ericharper Jun 23, 2022 Maintainer

sqrk Jun 24, 2022 Author

VahidooX Jun 24, 2022 Collaborator

sqrk Jun 29, 2022 Author

VahidooX Jun 29, 2022 Collaborator

titu1994 Jun 30, 2022 Maintainer

sqrk Jun 30, 2022 Author

titu1994 Jun 30, 2022 Maintainer

singhajeetzee May 9, 2023

titu1994 May 9, 2023 Maintainer

sqrk
Jun 19, 2022

Replies: 4 comments 7 replies

ericharper
Jun 23, 2022
Maintainer

sqrk
Jun 24, 2022
Author

VahidooX
Jun 24, 2022
Collaborator

sqrk Jun 29, 2022
Author

VahidooX Jun 29, 2022
Collaborator

titu1994 Jun 30, 2022
Maintainer

sqrk Jun 30, 2022
Author

titu1994 Jun 30, 2022
Maintainer

singhajeetzee
May 9, 2023

titu1994 May 9, 2023
Maintainer