Speech-to-Speech (Zero-Shot Voice Conversion) #82

fakerybakery · 2023-11-22T23:51:59Z

fakerybakery
Nov 22, 2023

Is it possible to implement speech-to-speech somewhat similar to this?

Here's an image of how they do it from the website:

Thanks!

yl4579 · 2023-11-23T02:20:24Z

yl4579
Nov 23, 2023
Maintainer

I think what you mean is voice conversion. I have done this for StyleTTS and it should work for StyleTTS 2 as well. See https://github.com/yl4579/StyleTTS-VC. The idea is you align the text with the input melspectrograms, use the aligned phonemes F0 and energy of the input and a different speaker embedding to reconstruct the speech. The alignment and text can be replaced with some text encoder, as shown in StyleTTS-VC.

0 replies

yl4579 · 2023-11-24T10:29:48Z

yl4579
Nov 24, 2023
Maintainer

I think this is interesting though, so if someone wants to apply the idea of StyleTTS-VC to StyleTTS 2 and use encoders like https://github.com/auspicious3000/contentvec to better disentangle the speaker information, it'd be greatly appreciated. Unfortunately I don't have the time to do this right now.

0 replies

yl4579 · 2023-11-24T10:42:10Z

yl4579
Nov 24, 2023
Maintainer

The basic idea would be train an acoustic model using train_first.py (one probably can use the pre-trained LibriTTS model as a start but it needs some alignment due to different hop size and sampling rate), fine-tune contentvec to reconstruct speech while fixing the decoder, then finetune the decoder while fixing the contentvec. This process can be repeated iteratively even, and I believe it can produce better results than RVC (SoVITS). If anyone is interested please email me at yl4579@columbia.edu, or reply to this thread for further discussion.

3 replies

AWAS666 Nov 26, 2023

So as you mentioned RVC, do you think singing conversion could be possible too?

Cheneng Nov 27, 2023

Why you believe that the results of using contentvec + adain is better than contentvec + vits, especially the latter one is the more end to end one.

yl4579 Nov 27, 2023
Maintainer

@Cheneng the latter is not more end to end because it doesn’t update the speech encoder as far as I know. It extracts the SSL features directly as input without finetuning the encoder. StyleTTS-VC trains both the encoder and the decoder, while SoVITS only trains the decoder.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speech-to-Speech (Zero-Shot Voice Conversion) #82

{{title}}

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Speech-to-Speech (Zero-Shot Voice Conversion) #82

fakerybakery Nov 22, 2023

Replies: 3 comments · 3 replies

yl4579 Nov 23, 2023 Maintainer

yl4579 Nov 24, 2023 Maintainer

yl4579 Nov 24, 2023 Maintainer

AWAS666 Nov 26, 2023

Cheneng Nov 27, 2023

yl4579 Nov 27, 2023 Maintainer

fakerybakery
Nov 22, 2023

Replies: 3 comments 3 replies

yl4579
Nov 23, 2023
Maintainer

yl4579
Nov 24, 2023
Maintainer

yl4579
Nov 24, 2023
Maintainer

yl4579 Nov 27, 2023
Maintainer