Enable dynamic resolution input for Swin Transformer and variants #30656

the-neural-networker · 2024-05-05T04:05:14Z

What does this PR do?

This PR adds the interpolate_pos_encoding feature to the Swin Transformer model, allowing it to handle input images of different resolutions while leveraging existing pre-trained checkpoints.

Addresses #30579.

Changes

The following changes have been made to enable dynamic resolution input for the Swin Transformer model:

Added the interpolate_pos_encoding method to the SwinModel class. This method takes the pre-trained position embeddings and the target height and width as inputs and returns the interpolated position embeddings.
Modified the forward method of the SwinModel class to accept an interpolate_pos_encoding argument. When set to True, the model will interpolate the position embeddings based on the input image size.
Added a test case in test_modeling_swin.py to verify that the model can correctly interpolate position embeddings for input images of different sizes.

Who can review?

@amyeroberts

the-neural-networker · 2024-05-05T04:07:15Z

Any suggestions are more than welcome! This is my first open-source contribution!

amyeroberts

Looks great - thanks for adding this!

For the swin implementation, just a couple of small comments.

For the quality checks, could you:

Run make fix-copies
Add equivalent tests to the models which are updated with the fix-copies run and add any necessary changes to their modeling files so the feature is enabled there too?

tests/models/swin/test_modeling_swin.py

src/transformers/models/swin/modeling_swin.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

the-neural-networker · 2024-05-09T21:41:03Z

Hi @amyeroberts, thank you and I committed your suggestions! I will add the feature to the copies of Swin by running make fix-copies. I will have the update by the weekend.

NielsRogge · 2024-05-10T10:44:33Z

Hi,

I think Swin Transformer already works on any resolution.

the-neural-networker · 2024-05-10T20:10:28Z

Hi @NielsRogge, yes it looks like dynamic resolution is supported by Swin using the maybe_pad method. Correct me if I am wrong, it looks like the pixel values are padded to achieve dynamic resolution where as with vit the position encodings are interpolated to achieve the same. What should be the next step @amyeroberts? Do we want to still add interpolate_pos_encoding in swin? I think I can still write the tests for dynamic resolution input.

NielsRogge · 2024-05-11T12:11:17Z

@the-neural-networker I checked it, so currently any height and width are supported if divisible by 32. I assume you can continue with supporting interpolation of position embeddings, since that would allow any resolution (needs to be tested of course).

tests/models/swin/test_modeling_swin.py

the-neural-networker · 2024-05-13T19:44:40Z

@amyeroberts, @NielsRogge I've been working on implementing the Swin transformer variants and writing their corresponding tests. I noticed that for MaskFormerSwin, there doesn't seem to be a pretrained checkpoint available, and it appears that an integration test hasn't been written for it yet (similar to DonutSwin).

Since MaskFormerSwin is primarily used as a backbone, I'm wondering about the best approach for writing its integration test. Currently, the integration tests for ViT and other models utilize pretrained checkpoints and their associated image processors to test the interpolation functionality.

Given the lack of a pretrained checkpoint for MaskFormerSwin, could you provide some guidance on how to properly test its integration?

See test_modeling_maskformer_swin.py.

amyeroberts · 2024-05-15T15:08:18Z

@the-neural-networker In this case, when there aren't existing checkpoints or integration tests, it's OK for you to skip adding tests for the interpolate method

amyeroberts

Looks great - thanks for adding this and improving the library's models!

Just some nits on defaulting to False.

Running make fix-copies should resolve the code consistency checks

src/transformers/models/swin/modeling_swin.py

src/transformers/models/swinv2/modeling_swinv2.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

the-neural-networker · 2024-05-16T18:05:52Z

Hi @amyeroberts,

To clarify, should I:

Add the interpolate method to DonutSwin and MaskFormerSwin, but omit the tests? Or...
Leave DonutSwin and MaskFormerSwin unchanged?

amyeroberts · 2024-05-16T18:09:14Z

@the-neural-networker You can do either. I'd suggest 2, as the feature is unlikely to be used in the case of backbones. We can add it later if it's requested. If you find that you need to add it because of # Copied from comments, then you can add and not add integration tests for them

the-neural-networker · 2024-05-16T18:14:12Z

Thank you for the clarification!

the-neural-networker · 2024-05-16T19:32:26Z

It looks like interpolation needs to be added to DonutSwin and MaskFormerSwin because of #Copied from comments. So, I will be adding that, but omitting their tests.

amyeroberts

Thanks for adding this feature and for being so thoughtful about testing!

Are there any other changes to be pushed? Otherwise I think we're good to merge 🤗

the-neural-networker · 2024-05-17T16:44:47Z

Thank you for the thorough review and kind words! I think that is all the changes, so feel free to merge whenever you are ready!

…0656) * add interpolation of positional encoding support to swin * add style changes * use default image processor and make size a dictionary Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * remove logits testing Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Refactor image size validation logic when interpolation is disabled Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * remove asserts in modeling Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * add dynamic resolution input support to swinv2 * change size to ensure interpolation encoding path is triggered * set interpolate_pos_encoding default value to False Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * set interpolate_pos_encoding default value to False Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * set interpolate_pos_encoding default value to False Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * set interpolate_pos_encoding default value to False Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * set interpolate_pos_encoding default value to False Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * set interpolate_pos_encoding default value to False Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * set interpolate_pos_encoding default value to False Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * set interpolate_pos_encoding default value to False * add dynamic resolution input to donut swin * add dynamic resolution input to maskformer swin --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

…ggingface#30656) * add interpolation of positional encoding support to swin * add style changes * use default image processor and make size a dictionary Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * remove logits testing Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Refactor image size validation logic when interpolation is disabled Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * remove asserts in modeling Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * add dynamic resolution input support to swinv2 * change size to ensure interpolation encoding path is triggered * set interpolate_pos_encoding default value to False Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * set interpolate_pos_encoding default value to False Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * set interpolate_pos_encoding default value to False Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * set interpolate_pos_encoding default value to False Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * set interpolate_pos_encoding default value to False Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * set interpolate_pos_encoding default value to False Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * set interpolate_pos_encoding default value to False Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * set interpolate_pos_encoding default value to False * add dynamic resolution input to donut swin * add dynamic resolution input to maskformer swin --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

the-neural-networker added 2 commits May 4, 2024 23:39

add interpolation of positional encoding support to swin

05c82cf

add style changes

3b7947a

amyeroberts reviewed May 7, 2024

View reviewed changes

amyeroberts mentioned this pull request May 8, 2024

Community contribution: enable dynamic resolution input for more vision models. #30579

Open

11 tasks

the-neural-networker and others added 4 commits May 9, 2024 17:15

use default image processor and make size a dictionary

6f30f50

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

remove logits testing

efbae76

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

Refactor image size validation logic when interpolation is disabled

8fee924

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

remove asserts in modeling

6b910f6

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

amyeroberts reviewed May 13, 2024

View reviewed changes

tests/models/swin/test_modeling_swin.py Outdated Show resolved Hide resolved

add dynamic resolution input support to swinv2

5825838

the-neural-networker changed the title ~~Enable dynamic resolution input for Swin Transformer~~ Enable dynamic resolution input for Swin Transformer and variants May 13, 2024

change size to ensure interpolation encoding path is triggered

a480cda

amyeroberts approved these changes May 15, 2024

View reviewed changes

the-neural-networker and others added 7 commits May 15, 2024 14:30

set interpolate_pos_encoding default value to False

30cca49

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

set interpolate_pos_encoding default value to False

00f4830

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

set interpolate_pos_encoding default value to False

55d5602

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

set interpolate_pos_encoding default value to False

5a0834e

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

set interpolate_pos_encoding default value to False

553adf4

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

set interpolate_pos_encoding default value to False

9444611

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

set interpolate_pos_encoding default value to False

76bea6c

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

set interpolate_pos_encoding default value to False

d7f6f96

the-neural-networker added 2 commits May 16, 2024 15:32

add dynamic resolution input to donut swin

895a606

add dynamic resolution input to maskformer swin

6dbf7e3

amyeroberts approved these changes May 17, 2024

View reviewed changes

amyeroberts merged commit 481a957 into huggingface:main May 17, 2024
17 checks passed

amyeroberts mentioned this pull request May 21, 2024

Fix swin embeddings interpolation #30936

Merged

M-Ali-ML mentioned this pull request May 25, 2024

Add interpolation of positional embedding to swin2sr #31024

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable dynamic resolution input for Swin Transformer and variants #30656

Enable dynamic resolution input for Swin Transformer and variants #30656

the-neural-networker commented May 5, 2024 •

edited

Loading

the-neural-networker commented May 5, 2024

amyeroberts left a comment

the-neural-networker commented May 9, 2024

NielsRogge commented May 10, 2024

the-neural-networker commented May 10, 2024 •

edited

Loading

NielsRogge commented May 11, 2024

the-neural-networker commented May 13, 2024 •

edited

Loading

amyeroberts commented May 15, 2024

amyeroberts left a comment

the-neural-networker commented May 16, 2024 •

edited

Loading

amyeroberts commented May 16, 2024

the-neural-networker commented May 16, 2024

the-neural-networker commented May 16, 2024

amyeroberts left a comment

the-neural-networker commented May 17, 2024

Enable dynamic resolution input for Swin Transformer and variants #30656

Enable dynamic resolution input for Swin Transformer and variants #30656

Conversation

the-neural-networker commented May 5, 2024 • edited Loading

What does this PR do?

Changes

Who can review?

the-neural-networker commented May 5, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

the-neural-networker commented May 9, 2024

NielsRogge commented May 10, 2024

the-neural-networker commented May 10, 2024 • edited Loading

NielsRogge commented May 11, 2024

the-neural-networker commented May 13, 2024 • edited Loading

amyeroberts commented May 15, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

the-neural-networker commented May 16, 2024 • edited Loading

amyeroberts commented May 16, 2024

the-neural-networker commented May 16, 2024

the-neural-networker commented May 16, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

the-neural-networker commented May 17, 2024

the-neural-networker commented May 5, 2024 •

edited

Loading

the-neural-networker commented May 10, 2024 •

edited

Loading

the-neural-networker commented May 13, 2024 •

edited

Loading

the-neural-networker commented May 16, 2024 •

edited

Loading