Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inquiry on embedding extractions for voice comparisons #105

Open
PhilipAmadasun opened this issue Jan 7, 2024 · 16 comments
Open

Inquiry on embedding extractions for voice comparisons #105

PhilipAmadasun opened this issue Jan 7, 2024 · 16 comments

Comments

@PhilipAmadasun
Copy link

PhilipAmadasun commented Jan 7, 2024

If I want my program to recognize the voice of someone whose embedding I've already stored. Is it better that the stored embedding be extracted from a short or long .wav of the person speeking, for the effect of the model having an easier time identifying the voice correctly (at least to 0.75 to 0.8 probability match). Or does the length not matter to some extent? (for instance a 2 minute .wav file over a 5 minute or 10 minute .wav file). I want to compare the stored embedding with an embedding of the person speeking for say 5 seconds, 10 seconds and longer.

I'm using pre-trained model ResCNN_triplet_training_checkpoint_265.h5. Also how does this model handle noise?

@philipperemy
Copy link
Owner

@PhilipAmadasun the model was trained on the LibriSpeech dataset: https://www.openslr.org/12.

The utterances were (mostly) given without noise and were short (up to 10 seconds if I remember well).

I forgot in details but some wav were some were cut to fit the model.

So what you can do is:

  • Sample the voice of multiple people in your noisy environment.
  • Remove the noise with some VAD techniques for the wav you collect.
  • For each people you recorded, compare it with the other people you recorded and all the people in the dataset (you can dataset a few wav files from each person (train-clean-100 is a great start). What I mean by comparing is: if you pick two wav from the same person and you run the score, they should be closer than any wav from any different person.
  • You can make a python script for that. And then you can tune it by slicing the wav files and tune the hyper parameters (how many wav to average for a single person to produce a score, length of the wav etc).

@PhilipAmadasun
Copy link
Author

@philipperemy Just to make sure I understand this particular aspect of my inquiry. The length of the .wav files the embeddings are extracted from do not matter?

For example, from the same person, I compare the embedding of a 20 second recording of their voice with the embedding extracted from a 10 minute recording of their voice, I get 0.8 probability match. Then I compare embeddings from the 20 second recording with a 5 second recording, then 2 minute recording. I should still get around 0.8 probability matches (ideally)?

If this is indeed the case then the lengths of audio don't matter, if the length doesn't matter then what properties of the audio files matter (besides noise)?

If the same person talks with a higher pitch on one .wav file than in the other, would their still be a strong probability match between the two? Most likely not right? I'm just trying to figure out properties of voice/.wav file actually matter for batch_cosine_similarity to give a strong cosine similarity match. I think this would help me figure out my other lines of questioning. I hope this question makes sense?

@philipperemy
Copy link
Owner

philipperemy commented Jan 9, 2024

@PhilipAmadasun Yes it makes sense. I checked the code.

The model was trained on samples of 1.6 seconds (clear speech).

If you want the most robust result for inference, for any speaker, you should sample many wav segments of 1.6 seconds and you should average them. This will be the speaker vector.

The way to test it, would be:

  • Let's say you have 2 different speakers, each one speaking for 10 minutes. You have 2 wav files.
  • Cut each wav file in half. 2 segments of 5 minutes each. You now have speaker1/1.wav, speaker1/2.wav, speaker2/1.wav, speaker2/2.wav
  • Sample many segments of 1.6 seconds from speaker1/1.wav and speaker2/1.wav. Run the model on each of them and average them for each speaker. You will have 2 vectors speaker1_vector_1 and speaker2_vector_1.
  • Do exactly the same with the 2.wav's. You will have 2 additional vectors, speaker1_vector_2 and speaker2_vector_2.
  • Now you can compute the cosine distance to make sure that dist(speaker1_vector_1, speaker1_vector_2) << dist(speaker1_vector_1, speaker2_vector_1) - on other words, the 2 vectors of the same speaker should have a much lower distance than with any vector of the other speaker.

https://github.com/philipperemy/deep-speaker/blob/master/deep_speaker/constants.py#L17C1-L17C88

To answer your questions:

The length of the .wav files the embeddings are extracted from do not matter?

It does. cf. my answers below.

For example, from the same person, I compare the embedding of a 20 second recording of their voice with the embedding extracted from a 10 minute recording of their voice, I get 0.8 probability match. Then I compare embeddings from the 20 second recording with a 5 second recording, then 2 minute recording. I should still get around 0.8 probability matches (ideally)?

Yes ideally but you should make the comparisons on the same segment length which is around 1~2 seconds. If the recording is 1min, you can sample 60 files of 1s and average them. If the recording is 20s you can sample 20 times. And then you can compare the vector of the 1min with the vector of the 20s.

If this is indeed the case then the lengths of audio don't matter, if the length doesn't matter then what properties of the audio files matter (besides noise)?

Yes it does matter because the longer the recording is, the more stable the speaker vector should be. Indeed, the more files we average, the more consistent the vector estimation should be.

If the same person talks with a higher pitch on one .wav file than in the other, would their still be a strong probability match between the two? Most likely not right? I'm just trying to figure out properties of voice/.wav file actually matter for batch_cosine_similarity to give a strong cosine similarity match. I think this would help me figure out my other lines of questioning. I hope this question makes sense?

Most likely not right I'd say. That's why averaging across multiple recording might be the best way to really capture the voice properties of the speaker.

Also make sure you use wav with a sampling rate of 16,000 Hz.

@PhilipAmadasun
Copy link
Author

PhilipAmadasun commented Jan 14, 2024

@philipperemy When you say "averaging" do you literally mean element wise averaging? On a different not, how do I make sure deep-speaker has CUDA access? Is there a way of knowing?

@philipperemy
Copy link
Owner

@PhilipAmadasun
Copy link
Author

@PhilipAmadasun I have some issues with tensorflow, so I'm gonna create another issue for it. Please still keep this issue open as I do my tests.

@philipperemy
Copy link
Owner

okay cool.

@PhilipAmadasun
Copy link
Author

@philipperemy I might place this inquiry in the new issue I raised but I thought I would briefly ask here. Is it possible that tensorflow can be replaced with straight pytorch for deepspeaker. Or was there some specific reason you used tensorflow.

@philipperemy
Copy link
Owner

philipperemy commented Jan 20, 2024

It would require a lot of work to port it to pytorch. So I'd say it's not possible. At that time, pytorch did not exist I guess lol

@PhilipAmadasun
Copy link
Author

@philipperemy Is there a chance that you are working on a way for deespeaker to handle simultanuous cosine similarity calculations. As in let's say the user want's to compare a voice embedding with several saved voice embeddings at once. Does my question make sense?

@philipperemy
Copy link
Owner

philipperemy commented Jan 26, 2024

@PhilipAmadasun Oh I see. you just need to compute them one by one and average the result. If your user is x and you want to compare with y1, y2, ..., y3` you just do

np.mean([batch_cosine_similarity(x, y1), batch_cosine_similarity(x, y2), batch_cosine_similarity(x, y3)])

@PhilipAmadasun
Copy link
Author

PhilipAmadasun commented Jan 26, 2024

@philipperemy If I want to see if embedding x matches any of the saved embeddings y1,y2, or y3 you're saying I should use np.mean([batch_cosine_similarity(x, y1), batch_cosine_similarity(x, y2), batch_cosine_similarity(x, y3)]), I'm not sure how that makes sense, Shouldn't I compare x to the saved embeddings individually, then choose which comparisons pass some probability threshold?

@philipperemy
Copy link
Owner

@PhilipAmadasun oh yeah there are lot of ways to do that. What you're saying makes sense. But imagine if you have like 10,000 y_i, and if you take the max() instead of mean(), you will for sure find one y_i that has a high low cosine similarity. But that could be just an artifact. I don't have a strong idea on what would be the best way. You have to try multiple methods and see which ones works the best for your use case.

@PhilipAmadasun
Copy link
Author

@philipperemy I'll look into this more. By the way:

np.random.seed(123)
random.seed(123)

Are these important for anything, do they somehow help comparisons?

@philipperemy
Copy link
Owner

@PhilipAmadasun Not really because tensorflow on a GPU does not ensure that the calculations will be exactly the same. So it's pretty useless actually.

@PhilipAmadasun
Copy link
Author

PhilipAmadasun commented Feb 1, 2024

@philipperemy If I wanted deepspeaker to recognize my voice, I suppose it would be better to have saved (and averaged out) embeddings of my voice in various situations? So one embedding of me yelling, one with me at a higher pitch, one with me a little away from the microphone as usual etc? It doesn't seem like just one saved embedding would cut it, but I don't know. Using your averaging technique has slightly improved things, but I am unable to get a cosine similarity higher than 0.6.
For context this is my set up.

  • I have individual two minute clips of clear audio ( well as clear as I could get) of I and 3 other people.

  • I obtained an average embedding (breaking down the audio to 1.6 second chunks) for each person. I save these embeddings.

  • Any audio I use for comparison, I obtain an average embedding from. I compare these embeddings with the saved embeddings

NOTE I have started to look at the code base and it seems like it can actually be improved. For instance in audio.py:

    # right_blank_duration_ms = (1000.0 * (len(audio) - offsets[-1])) // self.sample_rate
    # TODO: could use trim_silence() here or a better VAD.
    audio_voice_only = audio[offsets[0]:offsets[-1]]
    mfcc = mfcc_fbank(audio_voice_only, sample_rate)

Can seems like it can be modified to use webrtcVAD instead no?

Also Second note: I'm thinking about this averaging method and don't really understand the logic behind it. Most likely because I don't understand vector operations well. I think this actually ties in to what I asked earlier as well. I don't think the averaged out vector estimation of someone's voice taken while they are in a neutral emotional state would match well with the vector estimation of that same person's voice if they are at a heightened emotional state. This is just one scenario I thought of that wouldn't be favorable to this method. I don't even know what method would work for such scenarios when I'm not getting favorable results in more controlled scenarios.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants