Enable dynamic resolution input for Beit #31053

OmarManzoor · 2024-05-27T07:59:47Z

What does this PR do?

Before submitting

Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

CC: @amyeroberts
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

OmarManzoor · 2024-05-27T08:07:05Z

Hi @amyeroberts

This PR is incomplete right now because I am unsure how to proceed. It seems that the BeitEncoder takes the original patch embeddings configuration as an input and hence the original window size
https://github.com/huggingface/transformers/pull/31053/files#diff-3f84bebd6be8d9c0f5c5068199f5c49eac8489d5fa466fb6fa08b0365e78dba4R679-R685
This is used to initialize the BeitRelativePositionBias and the dimensions mismatch when we interpolate the embeddings during the forward calls. Could you kindly guide regarding what I am missing?

amyeroberts

Thanks for adding! Just a few small comments

tests/models/beit/test_modeling_beit.py

amyeroberts · 2024-06-03T12:24:30Z

tests/models/beit/test_modeling_beit.py

+            with self.assertRaises(ValueError, msg="doesn't match model"):
+                model(pixel_values, interpolate_pos_encoding=False)


Just to make sure this still holds if anything happens upstream and to make things explicit, could you add the following above:

self.assertFalse(processor.do_center_crop)

src/transformers/models/beit/modeling_beit.py

amyeroberts · 2024-06-03T12:30:52Z

@OmarManzoor Regarding the relative position bias, looking at the modeling code, I think whenever the output of this model is used, then it will also need to be interpolated if interpolate_pos_encoding=True

OmarManzoor · 2024-06-03T13:54:18Z

@OmarManzoor Regarding the relative position bias, looking at the modeling code, I think whenever the output of this model is used, then it will also need to be interpolated if interpolate_pos_encoding=True

Could you kindly clarify a bit where exactly this should be added? Do we need to add a new interpolation function that works for BeitSelfAttention?

amyeroberts · 2024-06-03T15:16:14Z

@OmarManzoor You need to make sure that the relative position biases are interpolated wherever that is needed. This might be as an argument to the relative position class, or within the modules that use its output

OmarManzoor · 2024-06-03T16:55:22Z

@amyeroberts How do I calculate the interpolations for the relative position biases similar to how we calculated them for the embeddings?

…is True

amyeroberts

Looks great - thanks for adding this!

Only a small nit on the docstring for data2vec

amyeroberts · 2024-06-06T11:47:40Z

src/transformers/models/data2vec/modeling_data2vec_vision.py

@@ -670,6 +753,7 @@ def forward(
        head_mask: Optional[torch.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
+        interpolate_pos_encoding: bool = False,


Should be added to DATA2VEC_VISION_INPUTS_DOCSTRING

amyeroberts · 2024-06-06T13:47:29Z

Thanks again for adding this!

* Initial attempt * Updates: PR suggestions * Interpolate the relative position bias when interpolate_pos_encoding is True * Add slow tag for the added tests * Add in DATA2VEC_VISION_INPUTS_DOCSTRING

Initial attempt

5e1c32a

OmarManzoor changed the title ~~Initial attempt~~ Enable dynamic resolution input for Beit May 28, 2024

amyeroberts mentioned this pull request Jun 3, 2024

Community contribution: enable dynamic resolution input for more vision models. #30579

Open

11 tasks

amyeroberts reviewed Jun 3, 2024

View reviewed changes

Updates: PR suggestions

39af204

Interpolate the relative position bias when interpolate_pos_encoding …

e2834de

…is True

OmarManzoor marked this pull request as ready for review June 4, 2024 08:48

Add slow tag for the added tests

a7cd981

amyeroberts reviewed Jun 6, 2024

View reviewed changes

amyeroberts approved these changes Jun 6, 2024

View reviewed changes

Add in DATA2VEC_VISION_INPUTS_DOCSTRING

2080a21

amyeroberts merged commit 6811839 into huggingface:main Jun 6, 2024
18 checks passed

OmarManzoor deleted the dynamic_resolution_beit branch June 7, 2024 10:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable dynamic resolution input for Beit #31053

Enable dynamic resolution input for Beit #31053

OmarManzoor commented May 27, 2024 •

edited

Loading

OmarManzoor commented May 27, 2024 •

edited

Loading

amyeroberts left a comment

amyeroberts Jun 3, 2024

amyeroberts commented Jun 3, 2024

OmarManzoor commented Jun 3, 2024 •

edited

Loading

amyeroberts commented Jun 3, 2024

OmarManzoor commented Jun 3, 2024

amyeroberts left a comment

amyeroberts Jun 6, 2024

amyeroberts commented Jun 6, 2024

		with self.assertRaises(ValueError, msg="doesn't match model"):
		model(pixel_values, interpolate_pos_encoding=False)

Enable dynamic resolution input for Beit #31053

Enable dynamic resolution input for Beit #31053

Conversation

OmarManzoor commented May 27, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

OmarManzoor commented May 27, 2024 • edited Loading

amyeroberts left a comment

Choose a reason for hiding this comment

amyeroberts Jun 3, 2024

Choose a reason for hiding this comment

amyeroberts commented Jun 3, 2024

OmarManzoor commented Jun 3, 2024 • edited Loading

amyeroberts commented Jun 3, 2024

OmarManzoor commented Jun 3, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

amyeroberts Jun 6, 2024

Choose a reason for hiding this comment

amyeroberts commented Jun 6, 2024

OmarManzoor commented May 27, 2024 •

edited

Loading

OmarManzoor commented May 27, 2024 •

edited

Loading

OmarManzoor commented Jun 3, 2024 •

edited

Loading