[Double Control] What model is most needed? #30

lllyasviel · 2023-02-13T01:54:22Z

lllyasviel
Feb 13, 2023
Maintainer

We plan to train some models with "double controls", use two concat control maps and we are considering using images with holes as the second control map. This will lead to some model like "depth-aware inpainting" or "canny-edge-aware inpainting". Please also let us know if you have good suggestions.

Jaxkr · 2023-02-13T03:52:38Z

Jaxkr
Feb 13, 2023

I am very interested in depth-aware inpainting, which I also mentioned in #24. Additionally, img2img with additional pose/depth control would be amazing!

A big use-case for this would be for rotoscoped animations with (potentially) temporal coherence. By using the previous frame (with background removed) as the input to the next and having the additional pose control layer, we may be able to change the fewest pixels possible.

Previously this has been impossible and all Stable Diffusion animations have their trademark flicker and lack of coherence. I think "double controls" may allow us to crack that nut!

This was accomplished by adding controls from the pose model to an anime model, running each frame through a background remover, and a little prompt engineering:

Screenshare.-.2023-02-13.2_45_57.AM.mp4

Great coherence on outfit, but terribly flickery! Something to fix that flicker would be incredible.

8 replies

Cahiliye Feb 15, 2023

Lütfen biri artık bana tepki versin lütfen yalvarırım

janekm Feb 19, 2023

Alternatively I think previous frame + optical flow would be really interesting to try (as suggested in one of the issues)!

Filarius Feb 22, 2023

I'm on per-step latent motion blur. Just simple blending. It do some job for coherence, but bad on fast motion.
Now tryting to undertand how to use dense optical flow for this case.

EriIaz Feb 22, 2023

I'm on per-step latent motion blur. Just simple blending. It do some job for coherence, but bad on fast motion. Now trying to understand how to use dense optical flow for this case.

I am a photographer, so maybe I am biased towards my single shot methods, but I think the traditional photo and video editing methods are better at that, assuming you can select your subject directly or use a slightly edited depth map to separate the subject from your background. It's easier to directly manipulate pixels for that instead of invoking motion blur from noise with a diffusion model. A good traditional motion blur filter can give you precise controls with exact motion path of the blur, which can be a simple linear or curve vector, a complex path of shaking hands, it can even be radial blur, or combination of all the above. Movement can be linear, movement can be rotational, and that greatly affects the properties of your blur: if the camera moves with the moving subject, it will be the close stationary objects blurred the most, but distant objects won't be blurred as much because their angular velocity will be negligible. If the camera is stationary and only follows the subjects by turning towards it, we have rotation, we'll have consistent angular velocity for the entire background. It can also have it inconsistent with the background elements moving in different directions: one football player with the ball is in focus and runs fast, the football field is stationary and is blurred in one direction because camera follows the player, and some other player at the background is blurred in his own way because he is moving too. Or another example: a car with a camera follows horsemen riding fast on a tournament. All the riders will be relatively unaffected by motion blur, even if we focus on the current leader, close background and foreground will be blurred into oblivion, unless we are talking car's interior which is also stationary for the camera, and distant background will be blurred only a little bit, if at all, because it's distant and the angular velocity of it is very low. Extremely distant objects like sun, moon or stars will appear stationary in that case.
I guess may be possible to emulate that with a specific ControlNet, but it will be difficult to make a good blur map as a simplification of the desired blur pattern, then select the subject or apply it to a depth map, edited to represent the angular velocity of that blur on captured objects, and get a consistent results with it. We are too limited by relatively simple gradio webUI for that. And if we get controlnet as a PS/AE/whatever editor filter, it still will be easier to do that the old way unless we would be able to "talk" to the model with a prompt and it will take it very specifically, instructing the model to blur the exact area with exact blur pattern and being able to get a consistent results, both for single shot and for a video. Especially video, because inconsistent result be have atrocious temporal coherency.

TOKYOJAB Mar 18, 2023

It certainly is possible. Here is a grid of key frames I created using only stable diffusion. Even when overriding the underlying video, consistency is maintained, not just stylizing it. I've been posting quick videos of my progress over the last few months.. Have a look at the last few... https://www.youtube.com/@THEJABTHEJAB

Steampunk1080.mp4

LEP.NEW2b.mp4

dqueue · 2023-02-14T05:31:10Z

dqueue
Feb 14, 2023

I think canny-edge-aware inpainting would be very useful for better shape control with inpainting.

1 reply

imperator-maximus Feb 18, 2023

scribble/sketch based like in T2I-Adapter would be also great to get better inpainting results: https://github.com/TencentARC/T2I-Adapter

pd-michaelstanley · 2023-02-14T06:25:37Z

pd-michaelstanley
Feb 14, 2023

Great idea - I would be very interested to see any combinations of depth, semantic segmentation, and surface normals. I'm very interested to see how the model handles trade-offs between them, or if it can still generate diverse images. Also, great work!

3 replies

Njasa2k Feb 18, 2023

semantic segmentation

I'm not sure what semantic segmentation does, does it apply a specific prompt to each of the segmentation regions?

toyxyz Feb 18, 2023

If depth and pose can be used together, you can create the background and character at once!

AMEERAZAM08 Feb 28, 2023

how you did this one ?

Cahiliye · 2023-02-15T20:23:52Z

Cahiliye
Feb 15, 2023

Sizleri rahatsız edici bir durum söz konusu değildir bu platformu bilmiyorum fakat benim özel hayatımda ciddi etki yarattı bir bilen benim bir kaç sorumu yanıtlarmi rica ediyorum

0 replies

renrenzsbbb · 2023-02-21T08:58:22Z

renrenzsbbb
Feb 21, 2023

awesome！I look forward to "depth-aware inpainting" or "canny-edge-aware inpainting". I think these will be very useful for build 3d texture map for a pre-build mesh.

1 reply

ghpkishore Mar 2, 2023

https://github.com/haofanwang/ControlNet-for-Diffusers

See this repo. It is possible to do it using it

EriIaz · 2023-02-21T10:17:41Z

EriIaz
Feb 21, 2023

I think we also need inpainting model support in general, if possible. They don't work with ControlNet right now, but these models' ability to recognise the surroundings of masked area to generate seamless output is very useful. The problem is, they generate whatever they see fit, and if you inpaint a complicated image, you can only prompt it.
But with ControlNet, it will be possible to inpaint a person in the right pose, a building of the right shape, it really puts inpainting to the next level.
Now imagine that function in a Photoshop plugin, fed with layers... This really would revolutionize photo editing as we know it

3 replies

imperator-maximus Feb 21, 2023

as a developer of a Photoshop plugin I fully agree. The standard 1.5 / 2.x Inpainting also sometimes fails and does (perfect) background restoration instead. Even with every trick like negative prompt prompt weights does not help in some situations. So having a guidance like doing a sketch in inpainting area would be perfect. T2I-Adapter claims having it and both projects are about to integrate in Diffuser library. But as far I see ControlNet is getting much more attention.

ghpkishore Mar 2, 2023

You can also try paint by example. It does a better job inpainting.

imperator-maximus Mar 2, 2023

what you mean with paint by example? Maybe I missed something which is pretty likely these days 🙂

EriIaz · 2023-02-21T15:53:19Z

EriIaz
Feb 21, 2023

Is it possible to make a controlnet for colors?
I know we can use prompts, but prompt doesn't recognise the exact colors very well. ControlNet also can be used in img2img mode with another image or sketch layer as a guide for colour and lighting with relative ease, but you need splashes in the right places, and that's one way of doing that, I believe there may be other ways of doing that as well.
Is it possible to make a model which recognises colours of a given picture, turns it into a lut-alike map with preprocessor, and is capable of manipulating a diffusion model to generate an image in accordance with it, while also following the prompt for things like contrast, style and whatnot? So basically, you can either have txt2img generation with predefined colors as accurate as you weight/guidance allows it OR you can get a lot of colour correction options with img2img even at relatively low denoising strength. Can also be used for color grading a video with relative ease.

1 reply

ghpkishore Mar 2, 2023

There is a repo called composer which is made by Alibaba. The code isn’t out yet, but they have this in them. You can see the project for more details. https://github.com/damo-vilab/composer

nikitastaf1996 · 2023-02-21T16:10:59Z

nikitastaf1996
Feb 21, 2023

I was thinking about instruct pix2pix version of Controlnet.And of course waiting for 2.1 versions.

0 replies

znhacker · 2023-02-21T16:26:28Z

znhacker
Feb 21, 2023

除了目前的轮廓控制外，还希望有额外的颜色标注，图片内对象内容的多颗粒度标注（比如这个是一个瓶子，这个是一只狗，这是一个人的头部、表情微笑），景深标记（这个位于顶层，这个位于底层，这个位于第x层），物体对象的姿势骨骼标注

我觉得通过构图轮廓、颜色标记、内容标记（多颗粒度多参数的）、景深标记，物体的姿势骨骼标记这些弄下来，基本就把扩散模型驯化成完全可控的了吧😂 感觉已经很完整了，想不出额外的了

0 replies

janekm · 2023-02-22T16:22:24Z

janekm
Feb 22, 2023

Another idea, would it be possible to add an additional input for a clip image embedding? That way potentially a similar feature to Midjourney's image prompts could be achieved...

0 replies

chadboyda · 2023-02-23T00:55:40Z

chadboyda
Feb 23, 2023

Would love to see a model that controls luminosity. To be able to take an image and apply its luminosity and/or tone to a new diffusion.

0 replies

Dekker3D · 2023-02-27T08:19:48Z

Dekker3D
Feb 27, 2023

A plain inpainting model, capable of turning any fine-tuned version of SD into something that works like 1.5-inpainting, would be amazing.

I would also like to see a model that accepts an image and a number indicating a time offset or something like that, to allow some form of interpolation between previous and future frames in a video. People are already applying multiple ControlNets to a single generated image, so it could be nice if you could just stack multiple instances of this net with different frames.

The same as above could also probably be reused for generating another view of a building/object from another angle, as this would just be equivalent to the camera moving over that time offset. If this could be used to add consistency to Stable Diffusion, it would allow some great use cases for 3D art too.

1 reply

ghpkishore Mar 2, 2023

Hey , there is a link https://www.reddit.com/r/StableDiffusion/comments/zyi24j/how_to_turn_any_model_into_an_inpainting_model/

In this it is mentioned how you can convert any model to an inpainting model. You can use this and try if it works for you.

vivkul · 2023-03-01T05:32:32Z

vivkul
Mar 1, 2023

I am sorry if I don't understand, but isn't inpainting + scribbles (Fig 16) already double control? Thanks!

0 replies

floboc · 2023-03-01T19:08:35Z

floboc
Mar 1, 2023

Inpainting + Depth is definitely a must for me :)

0 replies

mhh0318 · 2023-03-02T08:10:50Z

mhh0318
Mar 2, 2023

actually add up two control net simply could realize the DuoControl effect. Here's an example:

the jupyter nb is provided here:
https://github.com/mhh0318/ControlNetDuo/blob/master/test.ipynb

2 replies

mhh0318 Mar 2, 2023

Btw, the scale factor between 2 modalities in the notebook will heavily influence the samples. Sometimes the model will ignore one of them and only focus on the other even two factor value are the same.
(I only test the NPT situation, I guess if we assign suitable prompt to the model, the ignored modality would come back again.)

Another interesting thing is that with a higher factor, in another word, a largers control features, the output img will be in gray style instead of RGB. It would caused by the larger mean of latents, as we always multiply the latents with 0.18125 after 1st vae stage.

My current repo only hacked cldm/cldm.py and ddim.py. If needed I would make a mergeable Multi-Modality version for ControlNet repo.

ghpkishore Mar 2, 2023

There is one in control net for diffusers in one of the issues. You can use that to get the multi control net working

haofanwang/ControlNet-for-Diffusers#18

joshuawilde · 2023-03-04T05:41:12Z

joshuawilde
Mar 4, 2023

@lllyasviel do you have an approximate ETA? This is not a demand at all, just a polite ask for when you think you'll have a new model on this, specifically depth + inpainting.

Even just a rough estimate would be nice to know more or less how many weeks.

Thanks!

1 reply

Njasa2k Mar 11, 2023

Maybe it takes about 400 GPU hours to train a model so probably a handful of weeks?

marcsyp · 2023-03-11T04:27:46Z

marcsyp
Mar 11, 2023

@lllyasviel

I'm not sure I understand the technical aspect of the question fully, but here are a few things that may trigger some ideas:

The ability to mask any of the controlnets using a secondary image would be amazing for instance, using a foreground subject mask to use a normal map and depth map and the inverted mask to use scribble for the background. (With "invert colors" checked, you should be able to use the same mask image for both purposes.) I believe this is what you are already considering.
One of the temporal coherence issues that I see having a big impact flicker on image sequences is drastic lighting changes -- if you could train a "lightmap" model that would extract a lightmap from an image and apply it via controlnet, this would have a huge benefit for temporal coherence as well as for inpainting and consistency across image sets generated from preexisting scenes.
Another thing I've noticed particularly with motion sequences is that often consecutive frames will generate textures that do not move in space even though the object may have moved (concrete ground surface, for instance). Some way to apply a texture map as a secondary control would be amazing, but I'm not really sure that this is what you're asking here.
The idea of using a gradient grayscale image as a secondary control of your controlnet would be amazing. Imagine you could use a vertical gradient that has a canny controlnet of high strength with a vertical gradient of white to black, and a canny controlnet of low strength with black to white (opposite direction, inverted colors), you could essentially create an image where the lower portion is highly controlled by canny that disintegrates into chaos as it moves up vertically. Of course creating complex grayscales for every controlnet you use could create some VERY interesting effects.

0 replies

VV-A-VV · 2023-08-15T16:52:22Z

VV-A-VV
Aug 15, 2023

Hi, is any "double controlnet training" code released now? I am trying to train the controlnet to disentangle some attributes in the image and control them while training a single controlnet cannot disentangle some attributes, since one attribute may be based on another attribute, so training two separate controlnet is not a good idea. I think it is neccesary to train "double controlnet". If I want to achieve this function, the change in code is that we concatenate the second controlnet latent into the original one ?

0 replies

geroldmeisinger · 2023-09-19T07:45:42Z

geroldmeisinger
Sep 19, 2023

I just stumpled upon this "Color-Canny ControlNet" I'd like to share. It was trained on images with canny and color mushed together.

One could argue: "what's the point? we can use multi-controlnet for this" and I agree but I think it's a good example for further discussion.

I'm not sure I understand the technical aspect of the question fully

Me neither. To clarify I want to outline different "terms":

Multi-controlnet inference: we can already mix multiple control nets in a serial pipeline to guide the generation
Mixed-image training: see example above. just mix the images into one RGB image and let control net figure it out
Multi-channel training: we can already activate any additional alpha channels and use RGB + Alpha1-3 (another RGB image) I think? see adding more "hint" to training process #271
Multi-control training ("double control"): using two different conditioning images in parallel? would that need an adaption of the architecture?

but I quite don't understand the difference between "multi-channel" and "multi-control". or is this really just about "yet another clever way of preparing a dataset to get a more specialized control net"?

Related work:

Composer: Creative and Controllable Image Synthesis with Composable Conditions (additional control nets: shape (depth? segment?), semantics (depth?), masking (inpaint?), style (reference?), content, intensity (luminance?), palette (color?)
UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild (additional control nets: bounding box, outpainting) - train a model on all control nets at once to even make it generalize on new tasks (zero-shot) like gray image colorization, image deblurring, and image inpainting
Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models (additional control nets: content)
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

1 reply

RaccoonDML Aug 27, 2024

Interesting topic! I'm trying to train a controlnet model with color and line control, but the results show that multi-channel training isn't working as expected. I believe this repo (https://github.com/jinxixiang/color_controlnet) provides an excellent implementation of multi-control training. If anyone can offer any insights or resources related to this problem, I would be incredibly grateful.

geroldmeisinger · 2023-09-19T08:12:44Z

geroldmeisinger
Sep 19, 2023

Additional tabular control for Brain-to-idea human models

A permutation of every control net I know of which could help to come up with further ideas. Now we just have to fill out the boxes. Let the first row be the "xyz-aware" part, e.g. "Depth-aware ... Inpainting":

	Canny	Depth	HED	M-LSD	Normal	Pose	Scribble	Segmentation	Inpaint	Lineart	Anime	Shuffle	Tile	Reference	Instruct	Color	Luminance	QR	Colorize	Sketch Colorize	Mysee	Temporal	Style	Content
Canny
Depth
HED
M-LSD
Normal
Pose
Scribble
Segmentation
Lineart
Anime
Shuffle
Inpaint
Tile
Reference
Instruct
Color
Luminance
QR
Colorize
Sketch Colorize
Mysee
Temporal
Style
Content

TODO:

de-duplicate: canny+HUE is basically the same. lineart+anime+scribble is basically the same
control image for control net? grayscale image which defines guidance-start, guidance-end, strength, strength at stepN etc.
add mask?
commutativity: is depth-aware inpaint the same as inpaint-aware depth?

2 replies

insomniaaac Apr 15, 2024

By April 2024, has anyone completed this table?

jiachen0212 Jul 12, 2024

i want: depth+segment~~~ hhh~

geroldmeisinger · 2023-10-17T09:41:39Z

geroldmeisinger
Oct 17, 2023

also see huggingface/diffusers#5406 (comment) (canny + inpaint)

0 replies

EmmyJ-0 · 2023-12-05T06:49:26Z

EmmyJ-0
Dec 5, 2023

For anyone who might be looking at this in the future. I looked into training a ControlNet model along side a second ControlNet model and I found that it was just much slower to train without a much noticeable difference in the end model. Perhaps it could be worth training models separately and then fine-tuning together, but I haven't tried it.

I also attempted to train ControlNet for something like TryOnDiffusion but found that the architecture for ControlNet just isn't well suited for that kind of task and it seems to be more suited for learning structural features and pixelwise comparisons.

It's worth noting I was training models on a single 4090, so I didn't push training to it's absolute limits, but once it looked like it wasn't really learning after a day or so, I gave up. I've had some better initial luck training IP Adapters for models that are more focused on semantic meaning rather than pixelwise comparison. I hope to get to the point where I can implement/train models directly from papers for Stable Diffusion but I'm not there yet.

Before looking into IP Adapters I briefly looked into training a single ControlNet where instead of a text prompt for cross attention I used an image embedding, but my initial tests were unsuccessful. I might revisit this, since I think there might be some use for that and my hunch is that I just had the wrong implementation the first time.

1 reply

geroldmeisinger Dec 5, 2023

thanks for your valuable report

geroldmeisinger · 2024-07-12T05:02:57Z

https://huggingface.co/xinsir/controlnet-union-sdxl-1.0 combines all in one controlnet: Openpose, Depth, Canny, Lineart, AnimeLineart, MLSD, Scribble, HED lines, Pidi lines (Softedge) , TEED lines, Segment, Normal maps

Would you consider open-sourcing the training code for multiple control sources?

geroldmeisinger Jul 15, 2024

I'm not Xinsir but you can ask, I got response too. It seems the training script is already on the todo list.
https://github.com/xinsir6/ControlNetPlus/blob/main/README.md#todo
https://huggingface.co/xinsir/controlnet-union-sdxl-1.0/discussions/13

[Double Control] What model is most needed? #30

lllyasviel Feb 13, 2023 Maintainer

Replies: 23 comments · 28 replies

lllyasviel
Feb 13, 2023
Maintainer

Replies: 23 comments 28 replies