Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable dynamic resolution input for Swin Transformer and variants #30656

Merged

Conversation

the-neural-networker
Copy link
Contributor

@the-neural-networker the-neural-networker commented May 5, 2024

What does this PR do?

This PR adds the interpolate_pos_encoding feature to the Swin Transformer model, allowing it to handle input images of different resolutions while leveraging existing pre-trained checkpoints.

Addresses #30579.

Changes

The following changes have been made to enable dynamic resolution input for the Swin Transformer model:

  1. Added the interpolate_pos_encoding method to the SwinModel class. This method takes the pre-trained position embeddings and the target height and width as inputs and returns the interpolated position embeddings.

  2. Modified the forward method of the SwinModel class to accept an interpolate_pos_encoding argument. When set to True, the model will interpolate the position embeddings based on the input image size.

  3. Added a test case in test_modeling_swin.py to verify that the model can correctly interpolate position embeddings for input images of different sizes.

Who can review?

@amyeroberts

@the-neural-networker
Copy link
Contributor Author

Any suggestions are more than welcome! This is my first open-source contribution!

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great - thanks for adding this!

For the swin implementation, just a couple of small comments.

For the quality checks, could you:

  • Run make fix-copies
  • Add equivalent tests to the models which are updated with the fix-copies run and add any necessary changes to their modeling files so the feature is enabled there too?

tests/models/swin/test_modeling_swin.py Outdated Show resolved Hide resolved
tests/models/swin/test_modeling_swin.py Outdated Show resolved Hide resolved
src/transformers/models/swin/modeling_swin.py Outdated Show resolved Hide resolved
src/transformers/models/swin/modeling_swin.py Outdated Show resolved Hide resolved
the-neural-networker and others added 4 commits May 9, 2024 17:15
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
@the-neural-networker
Copy link
Contributor Author

Hi @amyeroberts, thank you and I committed your suggestions! I will add the feature to the copies of Swin by running make fix-copies. I will have the update by the weekend.

@NielsRogge
Copy link
Contributor

Hi,

I think Swin Transformer already works on any resolution.

@the-neural-networker
Copy link
Contributor Author

the-neural-networker commented May 10, 2024

Hi @NielsRogge, yes it looks like dynamic resolution is supported by Swin using the maybe_pad method. Correct me if I am wrong, it looks like the pixel values are padded to achieve dynamic resolution where as with vit the position encodings are interpolated to achieve the same. What should be the next step @amyeroberts? Do we want to still add interpolate_pos_encoding in swin? I think I can still write the tests for dynamic resolution input.

@NielsRogge
Copy link
Contributor

@the-neural-networker I checked it, so currently any height and width are supported if divisible by 32. I assume you can continue with supporting interpolation of position embeddings, since that would allow any resolution (needs to be tested of course).

@the-neural-networker
Copy link
Contributor Author

the-neural-networker commented May 13, 2024

@amyeroberts, @NielsRogge I've been working on implementing the Swin transformer variants and writing their corresponding tests. I noticed that for MaskFormerSwin, there doesn't seem to be a pretrained checkpoint available, and it appears that an integration test hasn't been written for it yet (similar to DonutSwin).

Since MaskFormerSwin is primarily used as a backbone, I'm wondering about the best approach for writing its integration test. Currently, the integration tests for ViT and other models utilize pretrained checkpoints and their associated image processors to test the interpolation functionality.

Given the lack of a pretrained checkpoint for MaskFormerSwin, could you provide some guidance on how to properly test its integration?

See test_modeling_maskformer_swin.py.

@the-neural-networker the-neural-networker changed the title Enable dynamic resolution input for Swin Transformer Enable dynamic resolution input for Swin Transformer and variants May 13, 2024
@amyeroberts
Copy link
Collaborator

@the-neural-networker In this case, when there aren't existing checkpoints or integration tests, it's OK for you to skip adding tests for the interpolate method

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great - thanks for adding this and improving the library's models!

Just some nits on defaulting to False.

Running make fix-copies should resolve the code consistency checks

src/transformers/models/swin/modeling_swin.py Outdated Show resolved Hide resolved
src/transformers/models/swin/modeling_swin.py Outdated Show resolved Hide resolved
src/transformers/models/swin/modeling_swin.py Outdated Show resolved Hide resolved
src/transformers/models/swinv2/modeling_swinv2.py Outdated Show resolved Hide resolved
src/transformers/models/swinv2/modeling_swinv2.py Outdated Show resolved Hide resolved
src/transformers/models/swinv2/modeling_swinv2.py Outdated Show resolved Hide resolved
src/transformers/models/swinv2/modeling_swinv2.py Outdated Show resolved Hide resolved
the-neural-networker and others added 7 commits May 15, 2024 14:30
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
@the-neural-networker
Copy link
Contributor Author

the-neural-networker commented May 16, 2024

Hi @amyeroberts,

To clarify, should I:

  1. Add the interpolate method to DonutSwin and MaskFormerSwin, but omit the tests? Or...
  2. Leave DonutSwin and MaskFormerSwin unchanged?

@amyeroberts
Copy link
Collaborator

@the-neural-networker You can do either. I'd suggest 2, as the feature is unlikely to be used in the case of backbones. We can add it later if it's requested. If you find that you need to add it because of # Copied from comments, then you can add and not add integration tests for them

@the-neural-networker
Copy link
Contributor Author

Thank you for the clarification!

@the-neural-networker
Copy link
Contributor Author

It looks like interpolation needs to be added to DonutSwin and MaskFormerSwin because of #Copied from comments. So, I will be adding that, but omitting their tests.

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this feature and for being so thoughtful about testing!

Are there any other changes to be pushed? Otherwise I think we're good to merge 🤗

@the-neural-networker
Copy link
Contributor Author

Thank you for the thorough review and kind words! I think that is all the changes, so feel free to merge whenever you are ready!

@amyeroberts amyeroberts merged commit 481a957 into huggingface:main May 17, 2024
17 checks passed
itazap pushed a commit that referenced this pull request May 24, 2024
…0656)

* add interpolation of positional encoding support to swin

* add style changes

* use default image processor and make size a dictionary

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* remove logits testing

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Refactor image size validation logic when interpolation is disabled

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* remove asserts in modeling

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* add dynamic resolution input support to swinv2

* change size to ensure interpolation encoding path is triggered

* set interpolate_pos_encoding default value to False

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* set interpolate_pos_encoding default value to False

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* set interpolate_pos_encoding default value to False

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* set interpolate_pos_encoding default value to False

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* set interpolate_pos_encoding default value to False

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* set interpolate_pos_encoding default value to False

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* set interpolate_pos_encoding default value to False

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* set interpolate_pos_encoding default value to False

* add dynamic resolution input to donut swin

* add dynamic resolution input to maskformer swin

---------

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request Jun 11, 2024
…ggingface#30656)

* add interpolation of positional encoding support to swin

* add style changes

* use default image processor and make size a dictionary

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* remove logits testing

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Refactor image size validation logic when interpolation is disabled

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* remove asserts in modeling

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* add dynamic resolution input support to swinv2

* change size to ensure interpolation encoding path is triggered

* set interpolate_pos_encoding default value to False

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* set interpolate_pos_encoding default value to False

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* set interpolate_pos_encoding default value to False

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* set interpolate_pos_encoding default value to False

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* set interpolate_pos_encoding default value to False

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* set interpolate_pos_encoding default value to False

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* set interpolate_pos_encoding default value to False

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* set interpolate_pos_encoding default value to False

* add dynamic resolution input to donut swin

* add dynamic resolution input to maskformer swin

---------

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants