-
This question sounds simple and on some level I do feel like it should generalize but I don't feel it deeply in my bone and this annoys me. Even the OpenPose itself can't understand anime images. But now you are able to synthesize completely new anime images with random pose and use it to train OpenPose itself. So the capability of the trained ControlNet far surpasses OpenPose itself. What is the main ingredient that makes it generalize beyond the domain it's been trained on?
I guess that the 2nd one is not a requirement for generalization because #188 already tried to train smaller ControlNet models from scratch without basing it on the original UNet and it still works (not super convinced though, someone need to try to train OpenPose ControlNet-Lite and try it on anime images) But it's still not super clear why 1. makes it generalize. Note that the ControlNet doesn't need to see any anime images at all during its training phase. So how is it possible that it learns to deal with anime images anyway? How is it possible that the model learns to understand that human pose and anime character pose are the same? Why does it need to learn that? Why is it not cheating and learn only human pose? When feeding anime prompts or anime noisy latents, why doesn't it just give you garbage encodings? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
After thinking about it, I think I have a good explanation. The main question is probably: Why does the UNet learn to associate human pose to anime character pose by default? Also captions might play a role in making the model learn to think that human pose and anime pose are the same. For example, in the training data, there might be an anime character with the caption "T-pose anime character" and there might also be a human photo with caption "photo of Elon musk doing T-pose" so the model will link the concept of pose on both image styles into one neuron to save bandwidth and make it more generalized. |
Beta Was this translation helpful? Give feedback.
After thinking about it, I think I have a good explanation.
Let's say the task we are interested in is controlling human pose using OpenPose. We have to understand that the base UNet model already understands the concept of human pose, it's just that there is no way to tell it to generate a specific pose you want. It's like the model is already a capable artist but it doesn't talk the same language as you do so you aren't able to tell it what you want.
By training the ControlNet, we are not introducing many new concepts to the model. ControlNet is more like a translator that converts OpenPose image into embedding residuals that the UNet understands. It's basically just converting one lang…