export and augmentation for speaker verification #8132

DiTo97 · 2024-01-06T13:13:55Z

DiTo97
Jan 6, 2024

I have been looking into setting up a speaker verification system using NeMo's TitaNet model.

The general idea would be a system exposing a REST API with three endpoints:

{
  "openapi": "3.0.0",
  "info": {
    "title": "speaker verification REST API",
    "version": "1.0.0"
  },
  "paths": {
    "/register": {
      "post": {
        "summary": "register a speaker by fine-tuning and exporting a TitaNet model with new recordings",
        "requestBody": {
          "required": true,
          "content": {
            "multipart/form-data": {
              "schema": {
                "type": "object",
                "properties": {
                  "speaker": {
                    "type": "string"
                  },
                  "recordings": {
                    "type": "array",
                    "items": {
                      "type": "string",
                      "format": "binary"
                    }
                  }
                },
                "required": ["speaker", "recordings"]
              }
            }
          }
        },
        "responses": {
          "200": {
            "description": "speaker registered successfully"
          }
        }
      }
    },
    "/verify": {
      "post": {
        "summary": "verify speaker identity with a recording",
        "requestBody": {
          "required": true,
          "content": {
            "multipart/form-data": {
              "schema": {
                "type": "object",
                "properties": {
                  "recording": {
                    "type": "string",
                    "format": "binary"
                  }
                },
                "required": ["recording"]
              }
            }
          }
        },
        "responses": {
          "200": {
            "description": "<speaker>"
          },
          "404": {
            "description": "unknown speaker"
          }
        }
      }
    },
    "/unregister": {
      "post": {
        "summary": "unregister a speaker by fine-tuning and exporting a TitaNet model without his or her recordings",
        "requestBody": {
          "required": true,
          "content": {
            "application/json": {
              "schema": {
                "type": "object",
                "properties": {
                  "speaker": {
                    "type": "string"
                  }
                },
                "required": ["speaker"]
              }
            }
          }
        },
        "responses": {
          "200": {
            "description": "speaker unregistered successfully"
          },
          "404": {
            "description": "speaker not found"
          }
        }
      }
    }
  }
}

Now comes the difficult part as export and augmentation are key steps for the system quality of life (QoL).

After looking at the speaker verification Jupyter notebook, I still have a few doubts.

export

For starters, model exporting is standard practice when deploying machine learning model for inference, but there is no mention of it in the Jupyter notebook whose inference example uses the fine-tuned pytorch or NeMo checkpoint.

The only pointers I have found for TitaNet export to ONNX are #7245 and #6759, but from the comments it is not clear:

if the provided code for speaker embedding extraction would be correct for speaker verification;
if the ONNX model would be any faster than the original pytorch checkpoint.

Therefore, is there any official reference for export for speaker verification in NeMo?

augmentation

The Jupyter notebook suggests dataset augmentation for enhanced performance, which I think would be even more critical in such a scenario preventing a person from having to record a large amount of recordings for speaker registration.

Therefore, 1) what is the optimal number of recordings per speaker for speaker verification fine-tuning? 2) what is the optimal amount of dataset augmentation and which augmentation techniques would be the most appropriate?

A few examples are shared in the online augmentation Jupyter notebook, but it is not clear:

if the augmentation example for automatic speech recognition (ASR) would translate well for a speaker verification dataset;
which augmentation techniques would be the most appropriate for speaker verification;
how offline augmentation should be done as opposed to online augmentation.

The Jupyter notebook repeatedly suggests opting for one-step offline augmentation, avoiding excessive slowdown in training, but I could not find any official reference to offline dataset augmentation examples in NeMo.

Answered by nithinraok

Jan 26, 2024

export

pre-processor export still being left out from the exported ONNX graph;

This is correct, that is the reason we use current nemo model to use the preprocessor, but you can just import the dataloader for that class and replicate the dataloading to avoid preloading from model.

ONNX export for the pre-processor is not available due to pytorch/pytorch#81075;
automatic model loading from a [config]

I am not currently working on ONNX export so my knowledge can be outdated, adding @borisfom to answer this query.

(https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/recognition/conf/titanet-large.yaml) in inference is not available for exported models.

Yes, we currently onl…

View full answer

DiTo97 · 2024-01-10T01:48:30Z

DiTo97
Jan 10, 2024
Author

Any pointers on that, @titu1994?

2 replies

titu1994 Jan 10, 2024
Maintainer

Let me ping @tango4j

titu1994 Jan 10, 2024
Maintainer

And @nithinraok

nithinraok · 2024-01-11T00:01:22Z

nithinraok
Jan 11, 2024
Maintainer

Hi @DiTo97,
We don't have a separate tutorial for exporting the TitaNet model. You can export TitaNet using the export.py script. Have you tried building the ONNX model with that script?

In my experience (a year ago), the speed with ONNX is definitely faster than the native PyTorch checkpoint, but I don't recall exact numbers.

What is the optimal number of recordings per speaker for speaker verification fine-tuning?
Get as many as you can. Naturally, >10 hrs per speaker is very good. You can cut segments of 3 sec to maximize the number of recordings.
What is the optimal amount of dataset augmentation, and which augmentation techniques would be the most appropriate?
Refer to this paper on augmentation-related questions for Speaker Verification: link
How should offline augmentation be done as opposed to online augmentation?
Use individual classes from perturb.py to generate the audio. Use these along with your training set.

I suggest using online augmentation, as I did for TitaNet training.

3 replies

DiTo97 Jan 11, 2024
Author

Thanks for the response, @nithinraok; I will answer in points.

The answer will be long, but should be covering most topics and be useful for future readers, as well.

Thanks in advance for taking the time to answer this.

export

I tried both the provided script and #6759 correctly exporting a pre-trained TitaNet model.

I was also able to run it in inference using the following code snippet:

import onnxruntime
import torch
from torch import autocast

.
.
.

session = onnxruntime.InferenceSession("/path/to/export.onnx")


def inference(model, recording, recording_len):
    recording, recording_len = recording.to(model.device), recording_len.to(model.device)
    processed, processed_len = model.preprocessor(input_signal=recording, length=recording_len)

    with autocast():
        example = {
            session.get_inputs()[0].name: processed.numpy(), 
            session.get_inputs()[1].name: processed_len.numpy()
        }

        logits, _ = session.run(None, example)
        logits = np.asarray(logits)
        logits = torch.from_numpy(logits)

        return logits

Let me know if you notice any glaring mistakes.

At first I didn't get why we would still need to have the non-exported model loaded in memory.
IIUC from #7129 and #7857, we need to have the non-exported model because the pre-processor is not exported in the ONNX graph.

In fact, pre-processor export seems to have been added in #5512, but only with a torchaudio backend.

Therefore, are the following considerations all correct?

pre-processor export still being left out from the exported ONNX graph;
ONNX export for the pre-processor is not available due to pytorch#81075;
automatic model loading from a config in inference is not available for exported models.

putting all the pieces together, I would get a similar inference code snippet, after export. Again, is it correct?

import os

import hydra
import omegaconf
import onnxruntime
import torch
from nemo.collections.asr.modules import audio_preprocessing
from torch import autocast

.
.
.

artifact_dir = "/path/to/artifact-dir"


def export(model, artifact_dir):
    config = model.cfg.preprocessor

    with omegaconf.open_dict(config ):
        config.use_torchaudio = True

    preprocessor: audio_preprocessing.AudioToMelSpectrogramPreprocessor = hydra.utils.instantiate(config=config)
    preprocessor.eval()

    model.export(os.path.join(artifact_dir, "model.onnx"))
    preprocessor.export(os.path.join(artifact_dir, "preprocessor.pt"))


def loading(artifact_dir):
    session = onnxruntime.InferenceSession(os.path.join(artifact_dir, "model.onnx"))
    preprocessor = torch.jit.load(os.path.join(artifact_dir, "preprocessor.pt"))

    return session, preprocessor


def inference(session, preprocessor, recording, recording_len):
    recording, recording_len = recording.to(preprocessor.device), recording_len.to(preprocessor.device)
    processed, processed_len = preprocessor(input_signal=recording, length=recording_len)

    with autocast():
        example = {
            session.get_inputs()[0].name: processed.numpy(), 
            session.get_inputs()[1].name: processed_len.numpy()
        }

        logits, _ = session.run(None, example)
        logits = np.asarray(logits)
        logits = torch.from_numpy(logits)

        return logits

inference

After reading #4706, I have a few doubts on the best practises for verifying that a test recording belongs to one the speakers the TitaNet model has been finetuned with.

In this issuecomment you suggested computing batch embeddings followed by scoring the similariy between the test recording embedding and each of the known speakers' embeddings.

On my end, I am more inclined towards extracting inference logits:
https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/recognition/speaker_identification_infer.py#L80

The resulting code snippet could be something like this:

import torch


def verification(recording, recording_len, session, preprocessor, threshold):
    logits = inference(session, preprocessor, recording, recording_len)
    
    probabilities = torch.softmax(logits, -1)
    probabilities = torch.max(probabilities, -1)

    return probabilities > threshold

In your opinion, what is the best approach?

Moreover, following this other issuecomment silence removal should not be necessary if the test recording is longer than five seconds, and has a considerable amount of the speaker voice, right?

If that's not the case, is silence removal pre-processing supported in NeMo or should I be using pydub, or similar libraries, for that?

augmentation

Thanks for the paper link. It was a quite insightful read.

I still have a few questions starting from your suggestions and the paper:

is recording with a microphone considered telephonic or non-telephonic speech? In the former case the paper suggests windows of 1.5 seconds instead of three seconds; would you still recommend cutting segments of three seconds after silence removal for finetuning the model, regardless of the speech nature? Also, is segment cutting supported in NeMo, by setting some training config params (I see this file), or should that be done offline?
comparing the augmentation reported in the paper, and the training config you shared, is there a reason why RIR augmentation has been replaced by noise augmentation or would you still suggest both in the augmentation pipeline depending on the use case?
are you suggesting online augmentation over offline augmentation for ease of use (i.e., NeMo natively supports that) or for some reason related to training quality?

If I still chose the offline route, following the augmentation notebook, an example might be:

from nemo.collections.asr.parts.preprocessing import perturb


perturbations = [
    perturb.GainPerturbation(
        min_gain_dbfs=0, 
        max_gain_dbfs=50
    ),
    perturb.NoisePerturbation(
        manifest_path=<noise-manifest-path>, 
        min_snr_db=0,
        max_snr_db=50,
        max_gain_db=300.0
    ),
    perturb.RirAndNoisePerturbation(
        rir_manifest_path=<RIR-manifest-path,
        rir_prob=1,
        noise_manifest_paths=[<noise-manifest-path>],
        bg_noise_manifest_paths=[<noise-manifest-path>],
        min_snr_db=[20],
        max_snr_db=[20],
        bg_min_snr_db=[20],
        bg_max_snr_db=[20],
        noise_tar_filepaths=[None],
        bg_noise_tar_filepaths=[None]
    ),
    perturb.SpeedPerturbation(
        <sample-rate>, 
        "kaiser_fast", 
        min_speed_rate=0.5, 
        max_speed_rate=2.0, 
        num_rates=-1
    )
]


probabilities = [1.0, 1.0, 0.3, 0.5]


pipeline = list(zip(probas, perturbations))
pipeline = perturb.AudioAugmentor(pipeline)

or from config:

pipeline = perturb.AudioAugmentor.from_config(<config-path>)

As for this, I have two more questions:

in the sample training config noise manifest path is null: a) does that mean that the noise examples would be automatically provided by NeMo, if null? b) would the same be true for the code variant using perturb classes, both for noise and RIR augmentation?
the augmentation pipeline seembs to be accepting a segment (AudioSegment instance) and to be modifying it in place without returning anything: a) is that correct? b) does that mean that I would have to copy the segment not to modify the original one? c) does the pipeline accept batches of segments to be processed in parallel or only one segment at the time?

fine-tuning

Last but not least, going back to the system I have been working on and your suggestions, 10 hours of recording time per speaker is absolutely unfeasible in any real scenario.

Ideally, the system should verify the authenticity of a speaker enabling some third-party process.

After some discussion, we concluded that a feasible amount of recording time for any person with access to that third-party process would be somewhere between 10 and 30 minutes, because we would expect few people to be in favor of the speaker registration process for longer than that.

In my mind, that should be enough to have a fine-tuned model a bit better aligned to the registered speakers than what we would have by simply using a pre-trained model (e.g., those on NGC), storing a speaker embedding aggregating the embeddings of all recordings of that speaker and scoring a test recording using some similarity function and the thresholding cutoff.

In your experience, would you agree with that or having hours of speaker recording time is absolutely necessary to have any kind of stability in model performance after fine-tuning?

DiTo97 Jan 16, 2024
Author

@nithinraok, I know the comment is long, but did you have time to look into it?

nithinraok Jan 26, 2024
Maintainer

export

pre-processor export still being left out from the exported ONNX graph;

This is correct, that is the reason we use current nemo model to use the preprocessor, but you can just import the dataloader for that class and replicate the dataloading to avoid preloading from model.

ONNX export for the pre-processor is not available due to pytorch/pytorch#81075;
automatic model loading from a [config]

I am not currently working on ONNX export so my knowledge can be outdated, adding @borisfom to answer this query.

(https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/recognition/conf/titanet-large.yaml) in inference is not available for exported models.

Yes, we currently only support loading of nemo models. General procedure is to load the nemo model and then convert to onnx model graph for execution.

inference

In your opinion, what is the best approach?

I would suggest referring to inferring from a neural finetuned section in this script as you are trying to find a speaker which is part of training.

If that's not the case, is silence removal pre-processing supported in NeMo or should I be using pydub, or similar libraries, for that?

NeMo has voice activity detection models that could perform detection of voice so you may use them for the purpose of removal of noise, but as I said if speaker speaks most of the time then it should be fine to ignore silence. But if VAD performs well then I would suggest to go for it based on your choice.

augmentation

is recording with a microphone considered telephonic or non-telephonic speech? In the former case the paper suggests windows of 1.5 seconds instead of three seconds; would you still recommend cutting segments of three seconds after silence removal for finetuning the model, regardless of the speech nature? Also, is segment cutting supported in NeMo, by setting some training config params (I see this file), or should that be done offline?

For paper, we characterized telephonic speech as speech collected from telephone microphones and non-telephonic speech otherwise. If its for speaker verification training then training with fixed 3 sec segments would be fine but for improvement in diarizartion performance mix of 1.5, 2s and 3s is recommended. You may do it online with collate fun as well but it would affect training time, so if training time is not much concern you may do it online otherwise offline.

comparing the augmentation reported in the paper, and the training config you shared, is there a reason why RIR augmentation has been replaced by noise augmentation or would you still suggest both in the augmentation pipeline depending on the use case?

Use augmentations that would best reflect your testing scenarios.

are you suggesting online augmentation over offline augmentation for ease of use (i.e., NeMo natively supports that) or for some reason related to training quality?

Ease of use and NeMo does support that on-fly.

in the sample training config noise manifest path is null: a) does that mean that the noise examples would be automatically provided by NeMo, if null? b) would the same be true for the code variant using perturb classes, both for noise and RIR augmentation?

null in hydra mean its off by default and not mandated for loading of model. For noise augmentation training you need to provide noise files through manifest. You would need RIR pertubated samples for RIR augmentation. Refer to MUSAN noise dataset for these samples.

the augmentation pipeline seembs to be accepting a segment (AudioSegment instance) and to be modifying it in place without returning anything: a) is that correct? b) does that mean that I would have to copy the segment not to modify the original one? c) does the pipeline accept batches of segments to be processed in parallel or only one segment at the time?

Yes NeMo does augmentation in place with provide probability. If you probability is less then you don;t need to have a copy of the segment, just train for more steps. Dataloader would take care of batch processing no additional processing required.

fine-tuning

In your experience, would you agree with that or having hours of speaker recording time is absolutely necessary to have any kind of stability in model performance after fine-tuning?

I recommend having hours of speaker recording time for maximum accuracy, as you mentioned 30 mins of speaker voice should do as well depending on the backend you would use for verifying.

Thanks for great questions. Let me know if you have any other questions. Cheers.

Answer selected by DiTo97

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

export and augmentation for speaker verification #8132

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

export and augmentation for speaker verification #8132

DiTo97 Jan 6, 2024

export

augmentation

export

Replies: 2 comments · 5 replies

DiTo97 Jan 10, 2024 Author

titu1994 Jan 10, 2024 Maintainer

titu1994 Jan 10, 2024 Maintainer

nithinraok Jan 11, 2024 Maintainer

DiTo97 Jan 11, 2024 Author

export

inference

augmentation

fine-tuning

DiTo97 Jan 16, 2024 Author

nithinraok Jan 26, 2024 Maintainer

export

inference

augmentation

fine-tuning

DiTo97
Jan 6, 2024

Replies: 2 comments 5 replies

DiTo97
Jan 10, 2024
Author

titu1994 Jan 10, 2024
Maintainer

titu1994 Jan 10, 2024
Maintainer

nithinraok
Jan 11, 2024
Maintainer

DiTo97 Jan 11, 2024
Author

DiTo97 Jan 16, 2024
Author

nithinraok Jan 26, 2024
Maintainer