Ability to Read Longer Audio (ie Audiobooks) #83
Replies: 11 comments
-
I think so far it’s not very good at narrating the entire audiobook because the training data isn’t the entire audiobook. The training data is purely independent clip taken from amateur audiobooks readings, rather than an entire audiobook. It won’t be like ElevenLabs that are trained with professional audiobook datasets as these data are usually not public domains. However, if we do have the data, it can be easily changed to train on this sort of data, by conditioning on the previous style to sample the current style. This probably would reproduce the effect of ElevenLabs, especially for dialogues. The closest dataset to entire audiobook is LJSpeech, but again it’s completely non-fiction, so it won’t be good for any fiction reading (no dialogue), and it might produce unnatural intonation’s because each clip was treated independently during training. |
Beta Was this translation helpful? Give feedback.
-
Hmm. Thanks. LibriVox seems like a good place to get public domain audiobooks. Are there any plans to add this capability in the future? |
Beta Was this translation helpful? Give feedback.
-
LibriTTS is already taken from LibriVox, but for some reason they aren’t complete audiobook narration but very fragmentized clips taken from complete audiobook narrations. I don’t know why they remove a lot of clips. |
Beta Was this translation helpful? Give feedback.
-
I feel like the quality would be lower if you trained it on an entire audiobook, right? I don't know, I guess it just feels like the longer the samples are the worse it will be (I might be wrong). Maybe we can use Tortoise TTS's splitting script with this? However, if it's possible to train a TTS model on long text without degrading quality, it shouldn't be too hard to write a script to scrape LibriVox based on readers (they have an API). I was able to make this dataset a while back using their API, but I didn't include readers at that time. |
Beta Was this translation helpful? Give feedback.
-
No we do have to train on audio clips, but the idea is we condition the current style sampling on previous text and style, so it will be more continuous and possibly also makes it handle dialogue better (if the audio clips are split according to dialogues). It won’t work if we train on entire audio clips because we don’t have enough RAM. |
Beta Was this translation helpful? Give feedback.
-
Hmm interesting! Are you planning to implement something like this in the future? |
Beta Was this translation helpful? Give feedback.
-
Yeah probably, but I don't think it'll be that simple. If the effort is more than trivial concatenation it could be a different project or paper, but now the difference probably won't be big enough on LibriTTS dataset because there is no dialogue. It's more useful if we can get some fictional audiobook datasets that are separated by characters. |
Beta Was this translation helpful? Give feedback.
-
Hmm. Hypothetically, if there was a long audiobook dataset available, how difficult do you think it would be to implement? |
Beta Was this translation helpful? Give feedback.
-
I implemented a basic long-text reader on the online demo by splitting text, but it isn't perfect yet. (update: I removed it because someone said it made it harder to clone with Docker) |
Beta Was this translation helpful? Give feedback.
-
I am fine with removing the long-text option, because I think that it should be a default setting in every task. |
Beta Was this translation helpful? Give feedback.
-
The problem I had with long-text and splitting by sentences is that occasionally, only with short (less than 40 character) sentences I got very loud white noice, sometimes minutes long for a single sentence. I dealt with this by combining short sentences with other sentences around them to make sure none of the text blocks given to styletts2 was short (see #46) but if that was taken care of it would be much easier to deal with long-form text. |
Beta Was this translation helpful? Give feedback.
-
Hi,
Might it be possible to implement a
tqdm
progress bar for longer text? This would make it possible to easily narrate entire audiobooks!Thanks!
Beta Was this translation helpful? Give feedback.
All reactions