Need help configuring multi-language, multi-speaker VITS training #1590
Replies: 2 comments
-
You can define language codes per dataset in the dataset config and then the phonemizer would use separate G2P for each if they are supported by any of our default backends. Check
Characters should have all the characters you need for all your datasets. Punctuations are handled differently based on the G2P backend used. Thus it is important to separate them out. If set use_phonemes False then You can also override any class as you wish based on your requirements. You don't have to use the default classes.
More is always better. We've not observed any performance problems. It should be fine to train a single model with all the languages you have. |
Beta Was this translation helpful? Give feedback.
-
@erogol What if I want to use phonemes for some languages and graphemes for the other. How can I configure this setting? |
Beta Was this translation helpful? Give feedback.
-
I already have some experience fine-tuning a monolingual multi-speaker VITS model (specifically, "en/vctk/vits") with Coqui. That was fairly straightforward since both datasets used the same single language, so adapting the config for the most part only involved changing a few variables to point to the new dataset files. I'm now considering expading the same model to multiple languages, and things suddenly get a lot more confusing for someone new to the library (and TTS models in general), so hoping someone can point me in the right direction.
Let's say I want to train a single model on three very different languages - English, Spanish and Japanese. Each of the languages would be trained on multiple speakers. I want to use the VITS model because it's end-to-end, seems to produce the best sounding results from my experimentation, and I already have significant training progress on a monolingual variant of it that I could presumably finetune/resume training from. This should be possible, given that YourTTS exists, is a variant of VITS, and is multi-language + multi-speaker. However, I should stress that I don't want to use YourTTS to resume training from, since that variant seems to be deliberately crippled by the authors due to some ethical reservations of theirs that remain unresolved after months, and I don't know whether that debilitation is limited to zero-shot voice cloning, or applies to the model's synthesis capabilities as a whole and would propagate to any of my own models trained based on it. I assume I'd also lose my current progress with the single-language VITS model I was training.
As for specific questions...
Phonemizers. When do/don't I want to use them in general, and should I in my specific case? If I even can, that is. From the config files and my limited understanding, it seems like they're language specific, and a single config only supports a single phonemizer and associated language - for example, the VCTK "en/vctk/vits" model I resumed from used the "espeak" phonemizer with a "phoneme_language" of "en". Since I now have 3 languages ("en","es","ja"), do I simply not use a phonemizer, ie. "use_phonemes" = False? What are the potential implications of this in terms of synthesis quality/training time?
Characters. I still don't fully understand the "characters" section of the config, even in a single-language context. Presumably, I want to make sure all characters that correlate with spoken sounds are listed in the "characters" subsection, but does the "punctuations" subsection matter, and if so how? What impact does it have on training/synthesis (if any), which symbols should be included and which discarded? For the "phonemes" subsection, do I just set it to an empty string if not using a phonemizer, or are they unrelated settings?
Adapting this to include Spanish is fairly straightforward - it's basically the same as english, plus variants of a handful of the letters. But when you consider Japanese, things start getting very... interesting. It's a language that can use 4 different scripts (kanji, hiragana, katakana, latin) within the same sentence, there are no spaces between words, different sequences of characters can have the same reading, a single sequence of characters can have multiple different readings, and the total character set numbers in the thousands. What's the proper way to define this in a way that the model can effectively "understand" and use?
Intuitively, it seems like the only way to resolve this would be to manually convert all text to romaji (the latin representation of Japanese script/sounds) and insert spaces between words for the training dataset. On first glance of the config for the only Japanese pretrained model available ("ja/kokoro/tacotron2-DDC"), this seems to be the approach taken, since the "characters" section only lists latin characters. However, when synthesizing with this model, using romaji input produces gibberish (single characters spelled out), while using raw Japanese input (mix of kanji/kana) produces proper speech, strongly suggesting the opposite - that the model was trained on raw Japanese input. This model uses "ja_jp_phonemizer" as the phonemizer, so I take it that all text-to-sound is delegated to that and the entire "characters" section is ignored? Assuming so, it's still a problem for my use case, since I can't set per-language phonemizers, and I can't use the Japanese phonemizer globally because my model also has English and Spanish. Is there a way to get around this, or am I forced to train two separate models - a Japanese one and a English+Spanish one? Also, the Japanese pretrained model uses a custom vocoder ("vocoder_models/ja/kokoro/hifigan_v1"), while the VITS model I want to train doesn't need a separate vocoder. And if I did use one, I'm again limited to using only one, but have multiple different languages. So how do I proceed vis-a-vis vocoders? Can I get away with simply not using one?
Beta Was this translation helpful? Give feedback.
All reactions