added interpolation for owlvit & owlv2. #34268

kshitij0-0 · 2024-10-20T12:42:50Z

What does this PR do?

Fixes # (issue)

Towards: Community contribution: enable dynamic resolution input for more vision models. #30579
Added interpolation for dynamic resolution inputs.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@amyeroberts do review this, the only task that needed a check in the box :)

LysandreJik · 2024-10-22T09:14:42Z

cc @qubvel @molbap @yonigozlan

molbap

Hi @kshitij0-0 ! I took a quick look, it seems the interpolation isn't aligned with what we usually do, see my comment below - you can take a look at how interpolation of pos encodings is tackled in modeling_clip.py, typically with an interpolate_pos_encoding method. Feel free to ping me again when you're done with this!

molbap · 2024-10-23T12:25:15Z

src/transformers/models/owlv2/modeling_owlv2.py

+        if interpolate_pos_encoding:
+            if pixel_values.shape[2] != target_size or pixel_values.shape[3] != target_size:
+                pixel_values = torch.nn.functional.interpolate(
+                    pixel_values, size=(target_size, target_size), mode="bilinear", align_corners=False
+                )
+        else:
+            if pixel_values.shape[2] != target_size or pixel_values.shape[3] != target_size:
+                raise ValueError(
+                    f"Input image size ({pixel_values.shape[2]}*{pixel_values.shape[3]}) doesn't match model ({target_size}*{target_size})."
+                )


I'm a bit confused: here the patches are directly interpolated but the positional encodings are untouched. This will cause a misalignment between patches and positional encodings and degrade the results, unless there's something I'm missing?

Hi @molbap ,thanks for taking out time to review this. I've gone through the modeling_clip.py and will be taking the interpolate_pos_encoding method from it.

molbap · 2024-10-23T12:27:16Z

src/transformers/models/glm/modeling_glm.py

A rebase on main should get rid of this!

molbap · 2024-10-23T12:27:25Z

src/transformers/models/owlvit/modeling_owlvit.py

+
+        if interpolate_pos_encoding:
+            if pixel_values.shape[2] != target_size or pixel_values.shape[3] != target_size:
+                pixel_values = torch.nn.functional.interpolate(
+                    pixel_values, size=(target_size, target_size), mode="bilinear", align_corners=False
+                )
+        else:
+            if pixel_values.shape[2] != target_size or pixel_values.shape[3] != target_size:
+                raise ValueError(
+                    f"Input image size ({pixel_values.shape[2]}*{pixel_values.shape[3]}) doesn't match model ({target_size}*{target_size})."
+                )


same comment

kshitij0-0 · 2024-10-23T16:13:01Z

Hi @molbap , currently i've pushed changed only with respect to OWLViT , if it looks good , will replicate it on the v2 as well.

kshitij added 4 commits October 20, 2024 18:03

added interpolation for owlvit & owlv2.

9f54fc9

change embedding calculation inowlv2

f7a5912

changed imports

6ae09f1

changed imports

34b60d0

kshitij0-0 mentioned this pull request Oct 20, 2024

Community contribution: enable dynamic resolution input for more vision models. #30579

Open

11 tasks

molbap self-requested a review October 23, 2024 12:19

molbap reviewed Oct 23, 2024

View reviewed changes

added POS embed

e6d1ce6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added interpolation for owlvit & owlv2. #34268

added interpolation for owlvit & owlv2. #34268

kshitij0-0 commented Oct 20, 2024

LysandreJik commented Oct 22, 2024

molbap left a comment

molbap Oct 23, 2024

kshitij0-0 Oct 23, 2024

molbap Oct 23, 2024

molbap Oct 23, 2024

kshitij0-0 commented Oct 23, 2024

added interpolation for owlvit & owlv2. #34268

Are you sure you want to change the base?

added interpolation for owlvit & owlv2. #34268

Conversation

kshitij0-0 commented Oct 20, 2024

What does this PR do?

Who can review?

LysandreJik commented Oct 22, 2024

molbap left a comment

Choose a reason for hiding this comment

molbap Oct 23, 2024

Choose a reason for hiding this comment

kshitij0-0 Oct 23, 2024

Choose a reason for hiding this comment

molbap Oct 23, 2024

Choose a reason for hiding this comment

molbap Oct 23, 2024

Choose a reason for hiding this comment

kshitij0-0 commented Oct 23, 2024