Replies: 1 comment 1 reply
-
Hi @Brugio96! I actually have been working a little on a ControlNet for audio, specifically for audio inpainting. Not directly the text-to-speech task, but it applies the similar idea of using a ControlNet on model trained to generate mel-spectrograms. Check it out if you're curious or want to use it as inspiration! Code here --> https://github.com/zachary-shah/riff-cnet (sorry my repo is a little bit messy right now >.>) |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi, by reading about ControlNet, I wonder if it would be suitable for a text-to-speech task.
Instead of general images, the network should output mel-spectrograms, the input to the locked copy would be the text to synthesize and the input to the trainable copy should be for example a suitable representation of a reference speech (but we can think about other possibilities).
I was wondering if anyone has thought about this and thinks that it could be possible.
It is just an idea thrown at you, but I think it might work best with a ControlNet-a-like approach applied to diffusion models already trained to generate mel-spectrograms.
This is just a high-level idea that could be elaborated further if it sparks any creative ideas on your end.
Beta Was this translation helpful? Give feedback.
All reactions