-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added interpolation for owlvit & owlv2. #34268
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kshitij0-0 ! I took a quick look, it seems the interpolation isn't aligned with what we usually do, see my comment below - you can take a look at how interpolation of pos encodings is tackled in modeling_clip.py
, typically with an interpolate_pos_encoding
method. Feel free to ping me again when you're done with this!
if interpolate_pos_encoding: | ||
if pixel_values.shape[2] != target_size or pixel_values.shape[3] != target_size: | ||
pixel_values = torch.nn.functional.interpolate( | ||
pixel_values, size=(target_size, target_size), mode="bilinear", align_corners=False | ||
) | ||
else: | ||
if pixel_values.shape[2] != target_size or pixel_values.shape[3] != target_size: | ||
raise ValueError( | ||
f"Input image size ({pixel_values.shape[2]}*{pixel_values.shape[3]}) doesn't match model ({target_size}*{target_size})." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused: here the patches are directly interpolated but the positional encodings are untouched. This will cause a misalignment between patches and positional encodings and degrade the results, unless there's something I'm missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @molbap ,thanks for taking out time to review this. I've gone through the modeling_clip.py
and will be taking the interpolate_pos_encoding
method from it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A rebase on main should get rid of this!
|
||
if interpolate_pos_encoding: | ||
if pixel_values.shape[2] != target_size or pixel_values.shape[3] != target_size: | ||
pixel_values = torch.nn.functional.interpolate( | ||
pixel_values, size=(target_size, target_size), mode="bilinear", align_corners=False | ||
) | ||
else: | ||
if pixel_values.shape[2] != target_size or pixel_values.shape[3] != target_size: | ||
raise ValueError( | ||
f"Input image size ({pixel_values.shape[2]}*{pixel_values.shape[3]}) doesn't match model ({target_size}*{target_size})." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment
Hi @molbap , currently i've pushed changed only with respect to OWLViT , if it looks good , will replicate it on the v2 as well. |
What does this PR do?
Fixes # (issue)
Towards: Community contribution: enable dynamic resolution input for more vision models. #30579
Added interpolation for dynamic resolution inputs.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@amyeroberts do review this, the only task that needed a check in the box :)