Replies: 4 comments 5 replies
-
With smaller datasets, you can sometimes get better results by fine-tuning a multi-speaker model and replacing one of the speakers (I usually replace speaker 0). The arctic model is available for U.S. English and both vctk and aru for British English. |
Beta Was this translation helpful? Give feedback.
-
Probably something that should be added to the training readme, but when doing pre-tuning instead of fine-tuning, it would be nice to get an idea on how large the dataset should be best results.. 100 sentences, 1000, 10000? And of course, would be nice to know successes and performane based on the number of epochs. Just curious.. early research.. |
Beta Was this translation helpful? Give feedback.
-
https://github.com/sweetbbak/Neural-Amy-TTS/blob/main/model_33/amy.onnx Here's my models so far. I was wondering if you all knew if there was a benefit from training a model from scratch instead of fine-tuning? Also is there any place where people post their models? |
Beta Was this translation helpful? Give feedback.
-
Here's the setup for the training session I'm currently attempting. I am running wsl on windows, mostly following the official training docs and the wsl guide they link to. I only have a RTX 3060, so I expect things to be a little slow. I'm using a Ubuntu-22.04 distrobution.
I am training a multi speaker voice, fine tuning from the high quality lessac voice. I used 6 of the speakers from the train_clean_100 subset of the LibriTTS-R dataset. about 660 total wavs. The LibriTTS-R dataset (http://www.openslr.org/141/) takes from LibriTTS, but cleans up the recordings significantly. The high quality LibriTTS voice that has been released is great, but many of the speakers have significant background noise/ the speaker sounds like they are in a hole. I have spent I-don't-know-how-long going through the demos of all 900+ speakers in that one looking for voices I like. I'm hoping to make one with a much smaller number of speakers, but all ones I think are usable. Also, the recordings in LibriTTS-R were not all immediately usable, as they have a bitrate of 24000. I wrote a little script that called the sox utility to convert the wavs to a bitrate of 22050 (and compile the metadata.csv from the normalized .txt files that accompany each recording) It's been running all night, and I'm only on epoch 147. I'm hoping to train 1000 epochs or so and see how they sound. |
Beta Was this translation helpful? Give feedback.
-
I just trained my first model last night over the past 11 hours, I had tried to set it up once but failed. I came back at it with a new approach trying to use Google Colab and immediately ran out of the free credits and memory. Part of the problem is that I had NO clue what a dataset should look like and immediately jumped to the OG ljspeech dataset as a reference, which was straight up 13,000+ wav files and nothing I had could handle that even on the lowest settings lol. Kinda dumb but I learned more about it. I do still want to see what a dataset that big would output though.
So I moved to my secondary laptop to just try to make things work which is running on a 1080ti laptop GPU. I trained a finetuned model off of the Amy checkpoint as I thought it sounds the best using 101 wav files (huge drop there lol) and it took about 11 hours to go from epoch 2160 to epoch 2999 but to my surprise it sounds surprisingly good and like the original speaker! Even then I suspect that this is overkill. It's wild because I was spending multiple days trying to reverse engineer the IV*NA amy applications so that I could use it on Linux.
I have an 6700xt and a 3060ti that isnt in my PC right now but I'm considering using it to train some more I just hate switching it out. Colab is a no go unless you're paying imo but at that point you may as well just rent a GPU or VM online with enough memory to blaze through the dataset.
My other idea is to potentially rotate through using other datasets using around 100-300 different wav files to see if I can get better output. I use TTS to make audiobooks and some of the weird sounds and halluinations get kind of annoying when you listen to them for 5 hours.
Whats your guy's training set up look like and have you learned anything useful?
Beta Was this translation helpful? Give feedback.
All reactions