Add suport for pyannote diarization 3.0 embedding model #184

sorgfresser · 2023-10-13T15:14:24Z

sorgfresser
Oct 13, 2023

TLDR

ONNX-Embedding model as specified in the new pyannote diarization pipeline can not be loaded with EmbeddingModel.from_pyannote. Multiple questions for this, they are at the end.

What's the issue?

I'm trying to get diart to work with the new pyannote 3.0 models. My major issue is the embedding model which is onnx for pyannote/speaker-diarization-3.0.

Whenever I try to load hbredin/wespeaker-voxceleb-resnet34-LM as an embedding model to pass it into SpeakerDiarizationConfigI get

requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/hbredin/wespeaker-voxceleb-resnet34-LM/resolve/main/pytorch_model.bin

Which is fair since the model is named speaker-embedding.onnx, as such pytorch_model.bin does not exist.

In pyannote there has been the addition of WeSpeakerPretrainedSpeakerEmbedding as a pipeline for this case. As such, I tried to initialize this pipeline beforehand and simply pass the pipeline to the EmbeddingModel.from_pyannote but since it is a pipeline and not a model the pyannote_loader.get_model() does not work as isinstance(model, Model) is false and pytorch lightning can't load onnx models.

Questions

Would it be even possible to use this as an embedding model?
If so, is it possible with the current diart already?
If it is technically possible but diart can't do it yet, can you point me in the gentle direction what I would have to change? I'd be willing to create a PR for this.

juanmc2005 · 2023-10-16T13:50:53Z

juanmc2005
Oct 16, 2023
Maintainer

Hi @sorgfresser, thank you for posting this, you make an excellent point.

Would it be even possible to use this as an embedding model?

Currently there's no possibility to use this model. The immediate reason is that, as you mention, the pyannote API has changed and this should be updated in the code.
However, there's also a fundamental compatibility problem: pyannote's x-vector allows a weights parameter to compute the weighted mean/std in the statistics pooling block (see here). This allows diart to obtain multiple speaker embeddings from a single audio chunk (see the paper, section 2.2.1)
In order to make the ResNet model compatible with this, there are two options:

Modify the speaker embedding code from pyannote/wespeaker to include the weight parameter for overlap-aware embeddings as described in the paper
Instead of using the weights for statistics pooling, crop the audio of each speaker and extract one embedding per speaker using the cropped audio

If so, is it possible with the current diart already?

No, this would require an integration effort with some coding. It's certainly doable though, and not very complicated IMO

If it is technically possible but diart can't do it yet, can you point me in the gentle direction what I would have to change? I'd be willing to create a PR for this.

With pleasure! I think I answered this question a bit generally in the first part.
Seeing the code of WeSpeakerPretrainedSpeakerEmbedding, I see there is a masks parameter that could relate to one of the two approaches I described above, probably approach number 2 (see here). However, it looks like there's no explanation of that parameter, maybe @hbredin can clarify? 👀

In any case, I'll try to point you to the exact pieces of code that would have to change:

Modify PyannoteLoader (see here) to load both v2 and v3 pyannote models. Alternatively, another loader could be created. Notice that the v3 API might already be compatible with v2 models. For example I think you should be able to load pyannote/embedding like this:

from pyannote.audio.pipelines.speaker_verification import PretrainedSpeakerEmbedding
get_embedding = PretrainedSpeakerEmbedding("pyannote/embedding")

Modify PyannoteEmbeddingModel (see here) to compute the appropriate masks from the segmentation weights (only if needed of course, otherwise keep weights).

And that's it, with these two changes you should be able to run the new model. I would greatly appreciate a PR with this feature! If you run into any troubles during the implementation let me know and I'll help with whatever I can. You can also create a draft PR from an initial implementation so we can discuss the code more easily right there.

Thank you! Looking forward to seeing that PR :)

2 replies

juanmc2005 Oct 16, 2023
Maintainer

Looking at how masks are used in WeSpeakerPretrainedSpeakerEmbedding (see here), I think the mask is a probability tensor where 1 means "use this frame to extract the embedding" and 0 means the opposite. This is later binarized with a 0.5 threshold.

So to add a bit more information about the changes in PyannoteEmbeddingModel, I would simply re-normalize weights to be between 0 and 1 along the frames axis, nothing else. This normalization should not change the effect on pyannote/embedding

hbredin Oct 17, 2023
Collaborator

This should do the trick, indeed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add suport for pyannote diarization 3.0 embedding model #184

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Add suport for pyannote diarization 3.0 embedding model #184

sorgfresser Oct 13, 2023

TLDR

What's the issue?

Questions

Replies: 1 comment · 2 replies

juanmc2005 Oct 16, 2023 Maintainer

Would it be even possible to use this as an embedding model?

If so, is it possible with the current diart already?

If it is technically possible but diart can't do it yet, can you point me in the gentle direction what I would have to change? I'd be willing to create a PR for this.

juanmc2005 Oct 16, 2023 Maintainer

hbredin Oct 17, 2023 Collaborator

sorgfresser
Oct 13, 2023

Replies: 1 comment 2 replies

juanmc2005
Oct 16, 2023
Maintainer

juanmc2005 Oct 16, 2023
Maintainer

hbredin Oct 17, 2023
Collaborator