-
-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inquiry on embedding extractions for voice comparisons #105
Comments
@PhilipAmadasun the model was trained on the LibriSpeech dataset: https://www.openslr.org/12. The utterances were (mostly) given without noise and were short (up to 10 seconds if I remember well). I forgot in details but some wav were some were cut to fit the model. So what you can do is:
|
@philipperemy Just to make sure I understand this particular aspect of my inquiry. The length of the .wav files the embeddings are extracted from do not matter? For example, from the same person, I compare the embedding of a 20 second recording of their voice with the embedding extracted from a 10 minute recording of their voice, I get If this is indeed the case then the lengths of audio don't matter, if the length doesn't matter then what properties of the audio files matter (besides noise)? If the same person talks with a higher pitch on one .wav file than in the other, would their still be a strong probability match between the two? Most likely not right? I'm just trying to figure out properties of voice/.wav file actually matter for batch_cosine_similarity to give a strong cosine similarity match. I think this would help me figure out my other lines of questioning. I hope this question makes sense? |
@PhilipAmadasun Yes it makes sense. I checked the code. The model was trained on samples of 1.6 seconds (clear speech). If you want the most robust result for inference, for any speaker, you should sample many wav segments of 1.6 seconds and you should average them. This will be the speaker vector. The way to test it, would be:
https://github.com/philipperemy/deep-speaker/blob/master/deep_speaker/constants.py#L17C1-L17C88 To answer your questions:
It does. cf. my answers below.
Yes ideally but you should make the comparisons on the same segment length which is around 1~2 seconds. If the recording is 1min, you can sample 60 files of 1s and average them. If the recording is 20s you can sample 20 times. And then you can compare the vector of the 1min with the vector of the 20s.
Yes it does matter because the longer the recording is, the more stable the speaker vector should be. Indeed, the more files we average, the more consistent the vector estimation should be.
Most likely not right I'd say. That's why averaging across multiple recording might be the best way to really capture the voice properties of the speaker. Also make sure you use wav with a sampling rate of 16,000 Hz. |
@philipperemy When you say "averaging" do you literally mean element wise averaging? On a different not, how do I make sure deep-speaker has CUDA access? Is there a way of knowing? |
@PhilipAmadasun yeah a simple It relies on keras/tensorflow. https://stackoverflow.com/questions/38009682/how-to-tell-if-tensorflow-is-using-gpu-acceleration-from-inside-python-shell. |
@PhilipAmadasun I have some issues with tensorflow, so I'm gonna create another issue for it. Please still keep this issue open as I do my tests. |
okay cool. |
@philipperemy I might place this inquiry in the new issue I raised but I thought I would briefly ask here. Is it possible that tensorflow can be replaced with straight pytorch for deepspeaker. Or was there some specific reason you used tensorflow. |
It would require a lot of work to port it to pytorch. So I'd say it's not possible. At that time, pytorch did not exist I guess lol |
@philipperemy Is there a chance that you are working on a way for deespeaker to handle simultanuous cosine similarity calculations. As in let's say the user want's to compare a voice embedding with several saved voice embeddings at once. Does my question make sense? |
@PhilipAmadasun Oh I see. you just need to compute them one by one and average the result. If your user is
|
@philipperemy If I want to see if embedding |
@PhilipAmadasun oh yeah there are lot of ways to do that. What you're saying makes sense. But imagine if you have like 10,000 y_i, and if you take the max() instead of mean(), you will for sure find one y_i that has a high low cosine similarity. But that could be just an artifact. I don't have a strong idea on what would be the best way. You have to try multiple methods and see which ones works the best for your use case. |
@philipperemy I'll look into this more. By the way:
Are these important for anything, do they somehow help comparisons? |
@PhilipAmadasun Not really because tensorflow on a GPU does not ensure that the calculations will be exactly the same. So it's pretty useless actually. |
@philipperemy If I wanted deepspeaker to recognize my voice, I suppose it would be better to have saved (and averaged out) embeddings of my voice in various situations? So one embedding of me yelling, one with me at a higher pitch, one with me a little away from the microphone as usual etc? It doesn't seem like just one saved embedding would cut it, but I don't know. Using your averaging technique has slightly improved things, but I am unable to get a cosine similarity higher than 0.6.
NOTE I have started to look at the code base and it seems like it can actually be improved. For instance in
Can seems like it can be modified to use Also Second note: I'm thinking about this averaging method and don't really understand the logic behind it. Most likely because I don't understand vector operations well. I think this actually ties in to what I asked earlier as well. I don't think the averaged out vector estimation of someone's voice taken while they are in a neutral emotional state would match well with the vector estimation of that same person's voice if they are at a heightened emotional state. This is just one scenario I thought of that wouldn't be favorable to this method. I don't even know what method would work for such scenarios when I'm not getting favorable results in more controlled scenarios. |
If I want my program to recognize the voice of someone whose embedding I've already stored. Is it better that the stored embedding be extracted from a short or long
.wav
of the person speeking, for the effect of the model having an easier time identifying the voice correctly (at least to 0.75 to 0.8 probability match). Or does the length not matter to some extent? (for instance a 2 minute.wav
file over a 5 minute or 10 minute.wav
file). I want to compare the stored embedding with an embedding of the person speeking for say 5 seconds, 10 seconds and longer.I'm using pre-trained model
ResCNN_triplet_training_checkpoint_265.h5
. Also how does this model handle noise?The text was updated successfully, but these errors were encountered: