ControlNet for Text to speech #300

Brugio96 · 2023-03-21T10:31:18Z

Brugio96
Mar 21, 2023

Hi, by reading about ControlNet, I wonder if it would be suitable for a text-to-speech task.
Instead of general images, the network should output mel-spectrograms, the input to the locked copy would be the text to synthesize and the input to the trainable copy should be for example a suitable representation of a reference speech (but we can think about other possibilities).
I was wondering if anyone has thought about this and thinks that it could be possible.
It is just an idea thrown at you, but I think it might work best with a ControlNet-a-like approach applied to diffusion models already trained to generate mel-spectrograms.
This is just a high-level idea that could be elaborated further if it sparks any creative ideas on your end.

zachary-shah · 2023-04-19T01:33:21Z

zachary-shah
Apr 19, 2023

Hi @Brugio96! I actually have been working a little on a ControlNet for audio, specifically for audio inpainting. Not directly the text-to-speech task, but it applies the similar idea of using a ControlNet on model trained to generate mel-spectrograms. Check it out if you're curious or want to use it as inspiration! Code here --> https://github.com/zachary-shah/riff-cnet (sorry my repo is a little bit messy right now >.>)

1 reply

HaiyiMei Jun 11, 2023

Really cool project 😄
I'm interested in this direction, and wonder if there's somewhere you want to go further. If there's any, please let me know, I wanna contribute to this!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ControlNet for Text to speech #300

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

ControlNet for Text to speech #300

Brugio96 Mar 21, 2023

Replies: 1 comment · 1 reply

zachary-shah Apr 19, 2023

HaiyiMei Jun 11, 2023

Brugio96
Mar 21, 2023

Replies: 1 comment 1 reply

zachary-shah
Apr 19, 2023