Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added interpolation for owlvit & owlv2. #34268

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

kshitij0-0
Copy link

What does this PR do?

Fixes # (issue)

Towards: Community contribution: enable dynamic resolution input for more vision models. #30579
Added interpolation for dynamic resolution inputs.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@amyeroberts do review this, the only task that needed a check in the box :)

@LysandreJik
Copy link
Member

cc @qubvel @molbap @yonigozlan

@molbap molbap self-requested a review October 23, 2024 12:19
Copy link
Contributor

@molbap molbap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kshitij0-0 ! I took a quick look, it seems the interpolation isn't aligned with what we usually do, see my comment below - you can take a look at how interpolation of pos encodings is tackled in modeling_clip.py, typically with an interpolate_pos_encoding method. Feel free to ping me again when you're done with this!

Comment on lines +298 to +307
if interpolate_pos_encoding:
if pixel_values.shape[2] != target_size or pixel_values.shape[3] != target_size:
pixel_values = torch.nn.functional.interpolate(
pixel_values, size=(target_size, target_size), mode="bilinear", align_corners=False
)
else:
if pixel_values.shape[2] != target_size or pixel_values.shape[3] != target_size:
raise ValueError(
f"Input image size ({pixel_values.shape[2]}*{pixel_values.shape[3]}) doesn't match model ({target_size}*{target_size})."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused: here the patches are directly interpolated but the positional encodings are untouched. This will cause a misalignment between patches and positional encodings and degrade the results, unless there's something I'm missing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @molbap ,thanks for taking out time to review this. I've gone through the modeling_clip.py and will be taking the interpolate_pos_encoding method from it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A rebase on main should get rid of this!

Comment on lines 291 to 301

if interpolate_pos_encoding:
if pixel_values.shape[2] != target_size or pixel_values.shape[3] != target_size:
pixel_values = torch.nn.functional.interpolate(
pixel_values, size=(target_size, target_size), mode="bilinear", align_corners=False
)
else:
if pixel_values.shape[2] != target_size or pixel_values.shape[3] != target_size:
raise ValueError(
f"Input image size ({pixel_values.shape[2]}*{pixel_values.shape[3]}) doesn't match model ({target_size}*{target_size})."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment

@kshitij0-0
Copy link
Author

Hi @molbap , currently i've pushed changed only with respect to OWLViT , if it looks good , will replicate it on the v2 as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants