Share your training setup and advice! #141

sweetbbak · 2023-07-16T19:38:35Z

sweetbbak
Jul 16, 2023

I just trained my first model last night over the past 11 hours, I had tried to set it up once but failed. I came back at it with a new approach trying to use Google Colab and immediately ran out of the free credits and memory. Part of the problem is that I had NO clue what a dataset should look like and immediately jumped to the OG ljspeech dataset as a reference, which was straight up 13,000+ wav files and nothing I had could handle that even on the lowest settings lol. Kinda dumb but I learned more about it. I do still want to see what a dataset that big would output though.

So I moved to my secondary laptop to just try to make things work which is running on a 1080ti laptop GPU. I trained a finetuned model off of the Amy checkpoint as I thought it sounds the best using 101 wav files (huge drop there lol) and it took about 11 hours to go from epoch 2160 to epoch 2999 but to my surprise it sounds surprisingly good and like the original speaker! Even then I suspect that this is overkill. It's wild because I was spending multiple days trying to reverse engineer the IV*NA amy applications so that I could use it on Linux.

I have an 6700xt and a 3060ti that isnt in my PC right now but I'm considering using it to train some more I just hate switching it out. Colab is a no go unless you're paying imo but at that point you may as well just rent a GPU or VM online with enough memory to blaze through the dataset.

My other idea is to potentially rotate through using other datasets using around 100-300 different wav files to see if I can get better output. I use TTS to make audiobooks and some of the weird sounds and halluinations get kind of annoying when you listen to them for 5 hours.

Whats your guy's training set up look like and have you learned anything useful?

synesthesiam · 2023-07-19T21:32:36Z

synesthesiam
Jul 19, 2023
Maintainer

With smaller datasets, you can sometimes get better results by fine-tuning a multi-speaker model and replacing one of the speakers (I usually replace speaker 0). The arctic model is available for U.S. English and both vctk and aru for British English.

2 replies

sweetbbak Jul 20, 2023
Author

I'll have to try it out with aru, I'm going for that classic British Amy voice for long-form synthesis but I was using the american english model. What is the advantage of using a multi-speaker model? I don't quite understand why or what they would be used for. Piper is an amazing project by the way, thank you for all your work.

synesthesiam Jul 20, 2023
Maintainer

You're welcome! Multi-speaker models let you switch between different voices very quickly (no need to load a new voice model), so that's one advantage. But the cooler thing is they learn how to characterize different speaking styles in general. So when adding a new speaker, the model only needs to figure out that you're a mixture of speakers 1, 7, 10, etc. to speak like you.

linuxmagic-mp · 2023-07-26T22:59:27Z

linuxmagic-mp
Jul 26, 2023

Probably something that should be added to the training readme, but when doing pre-tuning instead of fine-tuning, it would be nice to get an idea on how large the dataset should be best results.. 100 sentences, 1000, 10000? And of course, would be nice to know successes and performane based on the number of epochs. Just curious.. early research..

1 reply

sweetbbak Jul 27, 2023
Author

I feel you. I think it heavily depends on your GPU tbh. The dataset I did was 100 wavs that were fairly short and I did about 1000 epochs. In the end it sounded pretty good, I was working off a 1500 epoch model as a base for fine tuning. Doing some more research, epochs is what matter the most alongside dataset size. The larger the dataset the more GPU vram you will need.

I hope I'll find a sweet spot for what I need and I hope this thread will help other people get an idea of what they should do.

sweetbbak · 2023-08-01T17:38:14Z

sweetbbak
Aug 1, 2023
Author

https://github.com/sweetbbak/Neural-Amy-TTS/blob/main/model_33/amy.onnx
https://github.com/sweetbbak/Neural-Amy-TTS/blob/main/model_33/amy.onnx.json

Here's my models so far. I was wondering if you all knew if there was a benefit from training a model from scratch instead of fine-tuning? Also is there any place where people post their models?

0 replies

StoryHack · 2023-08-22T13:54:25Z

StoryHack
Aug 22, 2023

Here's the setup for the training session I'm currently attempting. I am running wsl on windows, mostly following the official training docs and the wsl guide they link to. I only have a RTX 3060, so I expect things to be a little slow. I'm using a Ubuntu-22.04 distrobution.
There are a couple of tweaks I had to do before it would start training.

Update requirements.txt by adding torchmetrics==0.11.4 (as others have done)
Adding
export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH
to ~/.bashrc (It couldn't find some library when I tried to run the training script, but this fixed it)
Using --quality high (My goal is to use piper to help in producing audiobooks/podcasts)
Drop the batch-size to 12

I am training a multi speaker voice, fine tuning from the high quality lessac voice. I used 6 of the speakers from the train_clean_100 subset of the LibriTTS-R dataset. about 660 total wavs. The LibriTTS-R dataset (http://www.openslr.org/141/) takes from LibriTTS, but cleans up the recordings significantly. The high quality LibriTTS voice that has been released is great, but many of the speakers have significant background noise/ the speaker sounds like they are in a hole. I have spent I-don't-know-how-long going through the demos of all 900+ speakers in that one looking for voices I like. I'm hoping to make one with a much smaller number of speakers, but all ones I think are usable.

Also, the recordings in LibriTTS-R were not all immediately usable, as they have a bitrate of 24000. I wrote a little script that called the sox utility to convert the wavs to a bitrate of 22050 (and compile the metadata.csv from the normalized .txt files that accompany each recording)

It's been running all night, and I'm only on epoch 147. I'm hoping to train 1000 epochs or so and see how they sound.

2 replies

truszko1 Aug 31, 2023

I'm curious... you recorded your own voice and then combined it with Libri to fine-tune? I've had success with only using my own wavs, which reduces the overall time and hardware requirements. I'm running on a Laptop 3080.

Have you had success training in the last week?

StoryHack Aug 31, 2023

I didn't record my own. I was just using 6 of the the improved libri recordings to make a multi-voice. It was taking forever and I needed to use the computer for something else, so I actually abandoned that one, but I got my linux machine back up and running and I'm using it to train train a new multivoice on four slightly longer sets.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Share your training setup and advice! #141

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Share your training setup and advice! #141

sweetbbak Jul 16, 2023

Replies: 4 comments · 5 replies

synesthesiam Jul 19, 2023 Maintainer

sweetbbak Jul 20, 2023 Author

synesthesiam Jul 20, 2023 Maintainer

linuxmagic-mp Jul 26, 2023

sweetbbak Jul 27, 2023 Author

sweetbbak Aug 1, 2023 Author

StoryHack Aug 22, 2023

truszko1 Aug 31, 2023

StoryHack Aug 31, 2023

sweetbbak
Jul 16, 2023

Replies: 4 comments 5 replies

synesthesiam
Jul 19, 2023
Maintainer

sweetbbak Jul 20, 2023
Author

synesthesiam Jul 20, 2023
Maintainer

linuxmagic-mp
Jul 26, 2023

sweetbbak Jul 27, 2023
Author

sweetbbak
Aug 1, 2023
Author

StoryHack
Aug 22, 2023