Speaker recognition example: requirements on audio files? #8006

trzy · 2023-12-10T01:03:20Z

trzy
Dec 10, 2023

I'd like to test the speaker recognition process using a few enrollment audio files and a variety of test files. I have a few .wav formatted recordings of myself speaking and a bunch of .ogg and .wav files of other speakers and myself mixed in as the test set. However, the script always blows up with tensor shape errors. Is there some assumption made about the input format? These recordings are at different sample rates and possibly with different channel sizes but I assumed they'd be normalized.

Example error:

[NeMo I 2023-12-09 16:59:15 features:289] PADDING: 16
[NeMo I 2023-12-09 16:59:17 save_restore_connector:249] Model EncDecSpeakerLabelModel was successfully restored from C:\Users\Bart\.cache\torch\NeMo\NeMo_1.21.0rc0\titanet-l\11ba0924fdf87c049e339adbf6899d48\titanet-l.nemo.
[NeMo I 2023-12-09 16:59:17 collections:445] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2023-12-09 16:59:17 collections:446] Dataset loaded with 6 items, total duration of  0.01 hours.
[NeMo I 2023-12-09 16:59:17 collections:448] # 6 files loaded accounting to # 1 labels
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.52s/it]
[NeMo I 2023-12-09 16:59:19 collections:445] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2023-12-09 16:59:19 collections:446] Dataset loaded with 3 items, total duration of  0.00 hours.
[NeMo I 2023-12-09 16:59:19 collections:448] # 3 files loaded accounting to # 1 labels
  0%|                                                                                                                                                       | 0/1 [00:00<?, ?it/s]
Error executing job with overrides: ['data.enrollment_manifest=bart.json', 'data.test_manifest=test.json', 'backend.backend_model=cosine_similarity']
Traceback (most recent call last):
  File "C:\projects\nemo\examples\speaker_tasks\recognition\speaker_identification_infer.py", line 61, in main
    test_embs, _, _, _ = speaker_model.batch_inference(test_manifest, batch_size, sample_rate, device=device,)
  File "C:\Users\Bart\anaconda3\envs\nemo\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "C:\Users\Bart\anaconda3\envs\nemo\lib\site-packages\nemo_toolkit-1.21.0rc0-py3.10.egg\nemo\collections\asr\models\label_models.py", line 638, in batch_inference
    logit, emb = self.forward(input_signal=audio_signal, input_signal_length=audio_signal_len)
  File "C:\Users\Bart\anaconda3\envs\nemo\lib\site-packages\wrapt-1.16.0-py3.10.egg\wrapt\wrappers.py", line 669, in __call__
    return self._self_wrapper(self.__wrapped__, self._self_instance,
  File "C:\Users\Bart\anaconda3\envs\nemo\lib\site-packages\nemo_toolkit-1.21.0rc0-py3.10.egg\nemo\core\classes\common.py", line 1084, in __call__
    instance._validate_input_types(input_types=input_types, ignore_collections=self.ignore_collections, **kwargs)
  File "C:\Users\Bart\anaconda3\envs\nemo\lib\site-packages\nemo_toolkit-1.21.0rc0-py3.10.egg\nemo\core\classes\common.py", line 228, in _validate_input_types
    raise TypeError(
TypeError: Input shape mismatch occured for input_signal in module EncDecSpeakerLabelModel :
Input shape expected = (batch, time) |
Input shape found : torch.Size([3, 106496, 2])

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Any help would be appreciated. Thank you!

redoctopus · 2023-12-11T18:51:44Z

redoctopus
Dec 11, 2023
Collaborator

With the disclaimer that I'm not too familiar with the speaker recognition codebase, this does look like an error I used to run into when recording my own data for testing when I didn't "flatten" the channels. The NeMo audio file loaders do normalize sample rate but I don't think they handle multiple channels properly--you'll have to average your recordings' left/right channels beforehand.

1 reply

trzy Dec 11, 2023
Author

Yep — turns out the inputs must be mono and I had some stereo inputs in there :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speaker recognition example: requirements on audio files? #8006

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Speaker recognition example: requirements on audio files? #8006

trzy Dec 10, 2023

Replies: 1 comment · 1 reply

redoctopus Dec 11, 2023 Collaborator

trzy Dec 11, 2023 Author

trzy
Dec 10, 2023

Replies: 1 comment 1 reply

redoctopus
Dec 11, 2023
Collaborator

trzy Dec 11, 2023
Author