Need help configuring multi-language, multi-speaker VITS training #1590

Kaoru8 · 2022-05-21T21:50:28Z

Kaoru8
May 21, 2022

I already have some experience fine-tuning a monolingual multi-speaker VITS model (specifically, "en/vctk/vits") with Coqui. That was fairly straightforward since both datasets used the same single language, so adapting the config for the most part only involved changing a few variables to point to the new dataset files. I'm now considering expading the same model to multiple languages, and things suddenly get a lot more confusing for someone new to the library (and TTS models in general), so hoping someone can point me in the right direction.

Let's say I want to train a single model on three very different languages - English, Spanish and Japanese. Each of the languages would be trained on multiple speakers. I want to use the VITS model because it's end-to-end, seems to produce the best sounding results from my experimentation, and I already have significant training progress on a monolingual variant of it that I could presumably finetune/resume training from. This should be possible, given that YourTTS exists, is a variant of VITS, and is multi-language + multi-speaker. However, I should stress that I don't want to use YourTTS to resume training from, since that variant seems to be deliberately crippled by the authors due to some ethical reservations of theirs that remain unresolved after months, and I don't know whether that debilitation is limited to zero-shot voice cloning, or applies to the model's synthesis capabilities as a whole and would propagate to any of my own models trained based on it. I assume I'd also lose my current progress with the single-language VITS model I was training.

As for specific questions...

Phonemizers. When do/don't I want to use them in general, and should I in my specific case? If I even can, that is. From the config files and my limited understanding, it seems like they're language specific, and a single config only supports a single phonemizer and associated language - for example, the VCTK "en/vctk/vits" model I resumed from used the "espeak" phonemizer with a "phoneme_language" of "en". Since I now have 3 languages ("en","es","ja"), do I simply not use a phonemizer, ie. "use_phonemes" = False? What are the potential implications of this in terms of synthesis quality/training time?
Characters. I still don't fully understand the "characters" section of the config, even in a single-language context. Presumably, I want to make sure all characters that correlate with spoken sounds are listed in the "characters" subsection, but does the "punctuations" subsection matter, and if so how? What impact does it have on training/synthesis (if any), which symbols should be included and which discarded? For the "phonemes" subsection, do I just set it to an empty string if not using a phonemizer, or are they unrelated settings?

Adapting this to include Spanish is fairly straightforward - it's basically the same as english, plus variants of a handful of the letters. But when you consider Japanese, things start getting very... interesting. It's a language that can use 4 different scripts (kanji, hiragana, katakana, latin) within the same sentence, there are no spaces between words, different sequences of characters can have the same reading, a single sequence of characters can have multiple different readings, and the total character set numbers in the thousands. What's the proper way to define this in a way that the model can effectively "understand" and use?

Intuitively, it seems like the only way to resolve this would be to manually convert all text to romaji (the latin representation of Japanese script/sounds) and insert spaces between words for the training dataset. On first glance of the config for the only Japanese pretrained model available ("ja/kokoro/tacotron2-DDC"), this seems to be the approach taken, since the "characters" section only lists latin characters. However, when synthesizing with this model, using romaji input produces gibberish (single characters spelled out), while using raw Japanese input (mix of kanji/kana) produces proper speech, strongly suggesting the opposite - that the model was trained on raw Japanese input. This model uses "ja_jp_phonemizer" as the phonemizer, so I take it that all text-to-sound is delegated to that and the entire "characters" section is ignored? Assuming so, it's still a problem for my use case, since I can't set per-language phonemizers, and I can't use the Japanese phonemizer globally because my model also has English and Spanish. Is there a way to get around this, or am I forced to train two separate models - a Japanese one and a English+Spanish one? Also, the Japanese pretrained model uses a custom vocoder ("vocoder_models/ja/kokoro/hifigan_v1"), while the VITS model I want to train doesn't need a separate vocoder. And if I did use one, I'm again limited to using only one, but have multiple different languages. So how do I proceed vis-a-vis vocoders? Can I get away with simply not using one?

Assuming I can get all this working in a single model, what are the practical limitations of doing so? The model architecture doesn't scale linearly with the number of speakers/languages, so there's an inherent limit to how much the model can learn. Adding more speakers and languages speeds up training/convergence and makes the model more robust since there's more data to learn from, but there's some point at which you have too much variety and the model can't converge. What's a rough approximate theoretical limit for a VITS model before synthesis performance starts degrading? Can I train it on 100 speakers, a thousand, ten thousand? At which point am I better off splitting the workload to two or more models?

erogol · 2022-05-31T08:08:13Z

erogol
May 31, 2022
Maintainer

Phonemizers. When do/don't I want to use them in general, and should I in my specific case? If I even can, that is. From the config files and my limited understanding, it seems like they're language specific, and a single config only supports a single phonemizer and associated language - for example, the VCTK "en/vctk/vits" model I resumed from used the "espeak" phonemizer with a "phoneme_language" of "en". Since I now have 3 languages ("en","es","ja"), do I simply not use a phonemizer, ie. "use_phonemes" = False? What are the potential implications of this in terms of synthesis quality/training time?

You can define language codes per dataset in the dataset config and then the phonemizer would use separate G2P for each if they are supported by any of our default backends. Check BaseDatasetConfig

Characters. I still don't fully understand the "characters" section of the config, even in a single-language context. Presumably, I want to make sure all characters that correlate with spoken sounds are listed in the "characters" subsection, but does the "punctuations" subsection matter, and if so how? What impact does it have on training/synthesis (if any), which symbols should be included and which discarded? For the "phonemes" subsection, do I just set it to an empty string if not using a phonemizer, or are they unrelated settings?`

Characters should have all the characters you need for all your datasets. Punctuations are handled differently based on the G2P backend used. Thus it is important to separate them out. If set use_phonemes False then phonemes would be discarded and only the characters would be used.

You can also override any class as you wish based on your requirements. You don't have to use the default classes.

Assuming I can get all this working in a single model, what are the practical limitations of doing so? The model architecture doesn't scale linearly with the number of speakers/languages, so there's an inherent limit to how much the model can learn. Adding more speakers and languages speeds up training/convergence and makes the model more robust since there's more data to learn from, but there's some point at which you have too much variety and the model can't converge. What's a rough approximate theoretical limit for a VITS model before synthesis performance starts degrading? Can I train it on 100 speakers, a thousand, ten thousand? At which point am I better off splitting the workload to two or more models?

More is always better. We've not observed any performance problems. It should be fine to train a single model with all the languages you have.

0 replies

siddas27 · 2024-08-28T02:00:31Z

siddas27
Aug 28, 2024

@erogol What if I want to use phonemes for some languages and graphemes for the other. How can I configure this setting?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need help configuring multi-language, multi-speaker VITS training #1590

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Need help configuring multi-language, multi-speaker VITS training #1590

Kaoru8 May 21, 2022

Replies: 2 comments

erogol May 31, 2022 Maintainer

siddas27 Aug 28, 2024

Kaoru8
May 21, 2022

erogol
May 31, 2022
Maintainer

siddas27
Aug 28, 2024