Why does the model generalize outside of the domain it's been trained on? #386

offchan42 · 2023-05-05T13:51:14Z

offchan42
May 5, 2023

This question sounds simple and on some level I do feel like it should generalize but I don't feel it deeply in my bone and this annoys me.
For example, the OpenPose ControlNet was probably trained only on real photos because OpenPose can only extract poses from real photos AFAIK. But surprisingly, the model is usable in anime or cartoon images without any additional training in that domain.

Even the OpenPose itself can't understand anime images. But now you are able to synthesize completely new anime images with random pose and use it to train OpenPose itself. So the capability of the trained ControlNet far surpasses OpenPose itself.

What is the main ingredient that makes it generalize beyond the domain it's been trained on?

Unchanging original UNet
ControlNet being initialized from UNet weights

I guess that the 2nd one is not a requirement for generalization because #188 already tried to train smaller ControlNet models from scratch without basing it on the original UNet and it still works (not super convinced though, someone need to try to train OpenPose ControlNet-Lite and try it on anime images)

But it's still not super clear why 1. makes it generalize. Note that the ControlNet doesn't need to see any anime images at all during its training phase. So how is it possible that it learns to deal with anime images anyway?
I kind of roughly know that when you ask the model to predict encodings that can be used with the original UNet's decoder, you are forcing it to come up with representations that are generalizable. But I am not satisfied with my vague explanation. Can anyone come up with concrete intuitive example to explain how the model could learn to generalize to unseen domain (extremely well)?

How is it possible that the model learns to understand that human pose and anime character pose are the same? Why does it need to learn that? Why is it not cheating and learn only human pose? When feeding anime prompts or anime noisy latents, why doesn't it just give you garbage encodings?

Answered by offchan42

May 26, 2023

After thinking about it, I think I have a good explanation.
Let's say the task we are interested in is controlling human pose using OpenPose. We have to understand that the base UNet model already understands the concept of human pose, it's just that there is no way to tell it to generate a specific pose you want. It's like the model is already a capable artist but it doesn't talk the same language as you do so you aren't able to tell it what you want.
By training the ControlNet, we are not introducing many new concepts to the model. ControlNet is more like a translator that converts OpenPose image into embedding residuals that the UNet understands. It's basically just converting one lang…

View full answer

offchan42 · 2023-05-26T15:21:50Z

offchan42
May 26, 2023
Author

After thinking about it, I think I have a good explanation.
Let's say the task we are interested in is controlling human pose using OpenPose. We have to understand that the base UNet model already understands the concept of human pose, it's just that there is no way to tell it to generate a specific pose you want. It's like the model is already a capable artist but it doesn't talk the same language as you do so you aren't able to tell it what you want.
By training the ControlNet, we are not introducing many new concepts to the model. ControlNet is more like a translator that converts OpenPose image into embedding residuals that the UNet understands. It's basically just converting one language into another language.
So when you talk to it in the language of pose, the model understands it and can apply the concept of pose to any image style that has pose in it even if you only train it on photos. (Of course the model is going to get better if you also train ControlNet on other image styles because there might be slight variations in the pose concept from different image styles, but it's not necessary)

The main question is probably: Why does the UNet learn to associate human pose to anime character pose by default?
It's because that's an efficient representation that is useful for all kinds of image styles. There might be a neuron that learns to control the pose of the character and this neuron will be reuseable across all styles of images. This neuron will control other neurons that draw human torso, legs, arms, hands, and head. This neuron should already exists before you even train any ControlNet. The model has to actually try hard not to learn this universal pose representation because the model will find it extremely useful to represent human poses among all image styles into the same neuron. If it doesn't share the pose concept among image styles then it will have to use many neurons for the concept of pose which is very wasteful. The model does not have many neurons in itself so it has to use these neurons as efficiently as it can. That's why it's quite expected that a shared pose neuron will emerge by default.

Also captions might play a role in making the model learn to think that human pose and anime pose are the same. For example, in the training data, there might be an anime character with the caption "T-pose anime character" and there might also be a human photo with caption "photo of Elon musk doing T-pose" so the model will link the concept of pose on both image styles into one neuron to save bandwidth and make it more generalized.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does the model generalize outside of the domain it's been trained on? #386

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Why does the model generalize outside of the domain it's been trained on? #386

offchan42 May 5, 2023

Replies: 1 comment

offchan42 May 26, 2023 Author

offchan42
May 5, 2023

offchan42
May 26, 2023
Author