Speech-to-Speech (Zero-Shot Voice Conversion) #82
Replies: 3 comments 3 replies
-
I think what you mean is voice conversion. I have done this for StyleTTS and it should work for StyleTTS 2 as well. See https://github.com/yl4579/StyleTTS-VC. The idea is you align the text with the input melspectrograms, use the aligned phonemes F0 and energy of the input and a different speaker embedding to reconstruct the speech. The alignment and text can be replaced with some text encoder, as shown in StyleTTS-VC. |
Beta Was this translation helpful? Give feedback.
-
I think this is interesting though, so if someone wants to apply the idea of StyleTTS-VC to StyleTTS 2 and use encoders like https://github.com/auspicious3000/contentvec to better disentangle the speaker information, it'd be greatly appreciated. Unfortunately I don't have the time to do this right now. |
Beta Was this translation helpful? Give feedback.
-
The basic idea would be train an acoustic model using |
Beta Was this translation helpful? Give feedback.
-
Is it possible to implement speech-to-speech somewhat similar to this?
Here's an image of how they do it from the website:
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions