-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable dynamic resolution input for Swin Transformer and variants #30656
Enable dynamic resolution input for Swin Transformer and variants #30656
Conversation
Any suggestions are more than welcome! This is my first open-source contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great - thanks for adding this!
For the swin implementation, just a couple of small comments.
For the quality checks, could you:
- Run
make fix-copies
- Add equivalent tests to the models which are updated with the fix-copies run and add any necessary changes to their modeling files so the feature is enabled there too?
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Hi @amyeroberts, thank you and I committed your suggestions! I will add the feature to the copies of Swin by running |
Hi, I think Swin Transformer already works on any resolution. |
Hi @NielsRogge, yes it looks like dynamic resolution is supported by Swin using the |
@the-neural-networker I checked it, so currently any height and width are supported if divisible by 32. I assume you can continue with supporting interpolation of position embeddings, since that would allow any resolution (needs to be tested of course). |
@amyeroberts, @NielsRogge I've been working on implementing the Swin transformer variants and writing their corresponding tests. I noticed that for Since Given the lack of a pretrained checkpoint for |
@the-neural-networker In this case, when there aren't existing checkpoints or integration tests, it's OK for you to skip adding tests for the interpolate method |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great - thanks for adding this and improving the library's models!
Just some nits on defaulting to False
.
Running make fix-copies
should resolve the code consistency checks
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Hi @amyeroberts, To clarify, should I:
|
@the-neural-networker You can do either. I'd suggest 2, as the feature is unlikely to be used in the case of backbones. We can add it later if it's requested. If you find that you need to add it because of |
Thank you for the clarification! |
It looks like interpolation needs to be added to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this feature and for being so thoughtful about testing!
Are there any other changes to be pushed? Otherwise I think we're good to merge 🤗
Thank you for the thorough review and kind words! I think that is all the changes, so feel free to merge whenever you are ready! |
…0656) * add interpolation of positional encoding support to swin * add style changes * use default image processor and make size a dictionary Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * remove logits testing Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Refactor image size validation logic when interpolation is disabled Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * remove asserts in modeling Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * add dynamic resolution input support to swinv2 * change size to ensure interpolation encoding path is triggered * set interpolate_pos_encoding default value to False Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * set interpolate_pos_encoding default value to False Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * set interpolate_pos_encoding default value to False Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * set interpolate_pos_encoding default value to False Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * set interpolate_pos_encoding default value to False Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * set interpolate_pos_encoding default value to False Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * set interpolate_pos_encoding default value to False Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * set interpolate_pos_encoding default value to False * add dynamic resolution input to donut swin * add dynamic resolution input to maskformer swin --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
…ggingface#30656) * add interpolation of positional encoding support to swin * add style changes * use default image processor and make size a dictionary Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * remove logits testing Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Refactor image size validation logic when interpolation is disabled Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * remove asserts in modeling Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * add dynamic resolution input support to swinv2 * change size to ensure interpolation encoding path is triggered * set interpolate_pos_encoding default value to False Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * set interpolate_pos_encoding default value to False Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * set interpolate_pos_encoding default value to False Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * set interpolate_pos_encoding default value to False Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * set interpolate_pos_encoding default value to False Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * set interpolate_pos_encoding default value to False Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * set interpolate_pos_encoding default value to False Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * set interpolate_pos_encoding default value to False * add dynamic resolution input to donut swin * add dynamic resolution input to maskformer swin --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
What does this PR do?
This PR adds the
interpolate_pos_encoding
feature to the Swin Transformer model, allowing it to handle input images of different resolutions while leveraging existing pre-trained checkpoints.Addresses #30579.
Changes
The following changes have been made to enable dynamic resolution input for the Swin Transformer model:
Added the
interpolate_pos_encoding
method to theSwinModel
class. This method takes the pre-trained position embeddings and the target height and width as inputs and returns the interpolated position embeddings.Modified the
forward
method of theSwinModel
class to accept aninterpolate_pos_encoding
argument. When set toTrue
, the model will interpolate the position embeddings based on the input image size.Added a test case in
test_modeling_swin.py
to verify that the model can correctly interpolate position embeddings for input images of different sizes.Who can review?
@amyeroberts