Replies: 1 comment
-
why go via the text interface? why not skip that, and supply an embedding of the augmentation, splicing it into the text embedding to create an augmented embedding? The reason I say this is that the transformation we have in mind, already begins its life as a vector. why turn it into text, just to have T5 attempt to turn it back into a vector? One thing I've been wondering for a couple weeks is "how can we efficiently teach Imagen a deep understanding of poses", and I wonder whether you could start with the pose keypoints, normalize them to a standard translation and rotation in 3D space, and separate out that translation/rotation/normalized pose into 3 vectors that you pass into the embedding layer. So it would learn a symbol for "sitting" that doesn't require you to be facing any particular angle. In other words giving it an understanding of poses that is equivariant w.r.t affine transformations. I'm not sure whether locality matters, but if you splice the pose embedding into the right location in the text embedding, then you can even use it as an adjective. |
Beta Was this translation helpful? Give feedback.
-
There's been some great work over the years showing how beneficial image augmentation is for various tasks and architectures. There's even been a bit of work in applying this to diffusion based models recently (https://arxiv.org/pdf/2206.00364.pdf). What I haven't seen is text augmentation (many text descriptions that map to a single image), which I'm sure undoubtedly exists somewhere.
Seems to me that it's a great way to artificially grow your dataset for training text-to-image models, as well as reducing epistemic uncertainty.
For instance, given an image that had N corresponding text descriptions, we could randomly sample the corresponding text at train time, perform the augmentation on the image, append the augmentation in words (e.g. 'the image is rotated X degrees'), and use the newly formed text and image pair as an example.
Thoughts?
Beta Was this translation helpful? Give feedback.
All reactions