Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Community contribution: enable dynamic resolution input for more vision models. #30579

Open
10 of 11 tasks
amyeroberts opened this issue Apr 30, 2024 · 40 comments
Open
10 of 11 tasks

Comments

@amyeroberts
Copy link
Collaborator

amyeroberts commented Apr 30, 2024

Feature request

Some of our models interpolate its positional embeddings, enabling pretrained checkpoints to be used on different input resolutions. For example, here in ViT.

Motivation

Let's add this to more models, to leverage existing checkpoints for new cases!

Your contribution

For anyone who would like to contribute, please comment on the issue, claiming a model you'd like to work on and share a link to the PR.

Each PR should:

  • Add an interpolate_pos_encoding method
  • Add a test showing the model can correctly interpolate an input image of a different size

There was a PR opened to add this to CLIP models, which is now inactive, but useful for reference of the changes to make: #27457

Once the PR is ready, you can ping me for review 🤗

@ashvinnihalani
Copy link

I can take Clip and Blip2.

@NielsRogge
Copy link
Contributor

Some heads up here; people have complained about the fact that interpolate_pos_encoding cannot be passed when using the Trainer to train models on higher-resolution images. I also am not that happy I named it interpolate_pos_encoding, should have been called interpolate_position_embeddings.

@bhuvanmdev
Copy link
Contributor

bhuvanmdev commented May 1, 2024

i can work on vit_mae and tvp

@amyeroberts
Copy link
Collaborator Author

Thanks for the heads up @NielsRogge!

Some heads up here; people have complained about the fact that interpolate_pos_encoding cannot be passed when using the Trainer to train models on higher-resolution images

OK, that's good to know. If many models have this it's a good reason to spend some time to figure out a solution! The most important thing is that it will work with a standard forward / backwards pass - if that's working we should be able to find a way to integrate if it's a wanted feature.

I also am not that happy I named it interpolate_pos_encoding, should have been called interpolate_position_embeddings.

Agreed interpolate_position_embeddings would have been better originally. Now interpolate_pos_encoding is implemented across the library I'd say it's better to stick with it to be consistent.

@NielsRogge
Copy link
Contributor

Yes so the problem is that the Trainer does not allow to pass any keyword arguments to the forward of a model.

However, there's a workaround: https://discuss.huggingface.co/t/fine-tuning-vit-with-more-patches-higher-resolution/18731/4?u=nielsr

bhuvanmdev added a commit to bhuvanmdev/transformers that referenced this issue May 2, 2024
…lip vision model

This commit introduces the `interpolate_pos_encoding` function to the `altclip` classes. It allows for high resolution images to be processed without image resizing.

partially solves Issue huggingface#30579
bhuvanmdev added a commit to bhuvanmdev/transformers that referenced this issue May 2, 2024
`interpolate_pos_encoding` function to the `altclip` vision models. It allows for high resolution images into the model for finetunning irrespective of the pre-trained image configuration.

issue huggingface#30579
@the-neural-networker
Copy link
Contributor

I can work on deit!

@jla524
Copy link
Contributor

jla524 commented May 3, 2024

I'd like to work on vivit

@faiez22222
Copy link

I can take Clip and Blip2.

Hi ashavinni
i am new to open source , can you help me little to get started with this task

@davidgxue
Copy link
Contributor

davidgxue commented May 3, 2024

I can work on chinese_clip. Will keep the team posted in the next few days. If I get more free time and there are remaining ones by then, happy to help out on additional tasks.

@g1y5x3
Copy link
Contributor

g1y5x3 commented May 3, 2024

Working on detr, a bit tricky. Will explain in the PR.

@davidgxue
Copy link
Contributor

Actually, I can also take bridgetower as well. They will come in as separate PRs though. Shouldn't be more complicated than chinese_clip.
So recap: I will work on both bridgetower and chinese_clip.

@nileshkokane01
Copy link
Contributor

nileshkokane01 commented May 4, 2024

@amyeroberts ,

How you manage this with make fix-copies , as most of the models are copied from CLIP and eventually we end up changing models that others have claimed for . I did change Git but that is copied from CLIP and that inturn triggers cascading changes.

Or avoid `make fix-copies' altogether before sending a PR?

@the-neural-networker
Copy link
Contributor

I will work on Swin, since DeiT is already implemented.

@yMayanand
Copy link

I will work on owlvit.

@amyeroberts
Copy link
Collaborator Author

@nileshkokane01 This is a good point - I'll update the list of models to indicates models which are "grouped" together. In the case of e.g. the CLIP family, there should just be one PR opened for adding the feature to CLIP and the models which are copied from it. The steps would be:

  • Make the changes for CLIP
  • Run make fix-copies to propogate to models which copy from CLIP
  • Update those models so feature is properly applied to all the models
  • Add tests for all the affected models

@davidgxue
Copy link
Contributor

davidgxue commented May 7, 2024

@nileshkokane01 @amyeroberts In that case, I will refrain from working on chinese_clip and bridgetower since both have # Copied from transformers.models.clip.modeling_clip.CLIPVisionEmbeddings with CLIP in the comments. I think Kosmos 2 may also be copied from CLIP. Most likely a fair amount on the list will be inheriting from CLIP (just as a heads up to other folks)

Update: oh nice thank you Amy for updating the description to group them

@davidgxue
Copy link
Contributor

davidgxue commented May 7, 2024

I can take siglip. I think some functions are still copied from CLIP but just skimming it, doesn't seem like they will be related to interpolate position encoding code

@zafstojano
Copy link
Contributor

@amyeroberts Doesn't idefics2 already handle this?

class Idefics2VisionEmbeddings(nn.Module):
"""
This is a modified version of `siglip.modelign_siglip.SiglipVisionEmbeddings` to enable images of variable
resolution.
The modifications are adapted from [Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution](https://arxiv.org/abs/2307.06304)
which allows treating images in their native aspect ratio and without the need to resize them to the same
fixed size. In particular, we start from the original pre-trained SigLIP model
(which uses images of fixed-size square images) and adapt it by training on images of variable resolutions.
"""

For example, the following sample script:

import torch
import requests
from PIL import Image
from transformers import Idefics2Processor, Idefics2ForConditionalGeneration

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

url = "https://upload.wikimedia.org/wikipedia/commons/c/cc/ESC_large_ISS022_ISS022-E-11387-edit_01.JPG"
images = [Image.open(requests.get(url, stream=True).raw)]
messages = [{
    "role": "user",
    "content": [
        {"type": "text", "text": "What's on the image?"},
        {"type": "image"},
    ],
}]

processor = Idefics2Processor.from_pretrained("HuggingFaceM4/idefics2-8b", do_image_splitting=False)
# Instead of the default 980, allow the largest edge to be 1500
processor.image_processor.size["longest_edge"] = 1500 

model = Idefics2ForConditionalGeneration.from_pretrained("HuggingFaceM4/idefics2-8b").to(device)

text = processor.apply_chat_template(messages)
inputs = processor(text=text, images=images, return_tensors="pt", padding=True)
for k, v in inputs.items():
    inputs[k] = v.to(device)

print("Input image shape:", inputs["pixel_values"].shape)

with torch.no_grad():
    out = model(**inputs)

print("Finished!")

Executes without errors and prints the following:

Loading checkpoint shards: 100%|████████████████████████| 7/7 [00:03<00:00,  2.26it/s]
Input image shape: torch.Size([1, 1, 3, 994, 1500])
Finished!

@davidgxue davidgxue mentioned this issue May 7, 2024
8 tasks
@bhuvanmdev
Copy link
Contributor

Since all clip like models can just borrow changes made to clip model, I will take tvp instead of altclip.

@amyeroberts
Copy link
Collaborator Author

@zafstojano Indeed! That's what I get for doing a quick grep and not double checking. Thanks for showing an example to verify. I'll take it off the list

@zafstojano
Copy link
Contributor

Opened a PR (#30722) addressing this issue for the BLIP family of models (BLIP, BLIP2, InstructBLIP).

@kyrajeep
Copy link

kyrajeep commented May 20, 2024

@amyeroberts I would like to work on DETR. Is anyone working on it?

@g1y5x3
Copy link
Contributor

g1y5x3 commented May 20, 2024

@amyeroberts I would like to work on DETR. Is anyone working on it?

I'm almost done. Was busy with work in the past 2 weeks.

@M-Ali-ML
Copy link

I'll be working on grounding_dino and hopefuly I will have a PR soon.

@amyeroberts
Copy link
Collaborator Author

@MightyStud Thanks for picking a model and working to add this feature! After reviewing #30921, I realised that this isn't something we can add for models with backbones, which includes grounding DINO and DETR related models. I've updated the list to reflect this.

@M-Ali-ML
Copy link

@amyeroberts Aha, thanks for letting me know, I'd like to work on swin2sr then since I already allocated time this week.

@OmarManzoor
Copy link
Contributor

OmarManzoor commented May 23, 2024

Hi @amyeroberts Can I try out beit or data2vec?

@amyeroberts
Copy link
Collaborator Author

@OmarManzoor Certainly!

@kishore-s-15
Copy link
Contributor

@amyeroberts Is there any model that I can work on in this task?

@amyeroberts
Copy link
Collaborator Author

@kishore-s-15 There is currently no open PR for deit

@kishore-s-15
Copy link
Contributor

kishore-s-15 commented May 28, 2024

Thanks, @amyeroberts, I would love to work on it. Could you assign it for me?

@p-kris10
Copy link
Contributor

@amyeroberts have opened a PR(#31131) for deit

@jacksoncamp42
Copy link

@amyeroberts Are there any models I can work on for this task?

@amyeroberts
Copy link
Collaborator Author

@jacksoncamp42 All models currently have open PRs. If you're interested in adding features to vision models, another way to contribute would be adding enabling device_map=auto: #29786

@jacksoncamp42
Copy link

@amyeroberts Thanks for the suggestion. Unfortunately, I currently don't have access to a multi-GPU environment.

Is there another area or feature that I can contribute to without needing a multi-GPU setup?

@amyeroberts
Copy link
Collaborator Author

@jacksoncamp42 Anyone in the community is welcome to tackle any issue within the library. For people who are contributing for the first time, we suggest looking for issues with the Good first issue or Good second issue tags

@manuelsh
Copy link
Contributor

manuelsh commented Sep 25, 2024

CLIP family models have been tackled (and merged) here: #32600

@kshitij0-0
Copy link

@amyeroberts , can you please have a look at the PR #34268 which adds interpolation in the owlvit models.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet