Community contribution: enable dynamic resolution input for more vision models. #30579

amyeroberts · 2024-04-30T17:00:10Z

ashvinnihalani · 2024-04-30T19:39:17Z

I can take Clip and Blip2.

NielsRogge · 2024-04-30T19:52:29Z

Some heads up here; people have complained about the fact that interpolate_pos_encoding cannot be passed when using the Trainer to train models on higher-resolution images. I also am not that happy I named it interpolate_pos_encoding, should have been called interpolate_position_embeddings.

bhuvanmdev · 2024-05-01T03:21:15Z

i can work on vit_mae and tvp

amyeroberts · 2024-05-01T09:05:38Z

Thanks for the heads up @NielsRogge!

Some heads up here; people have complained about the fact that interpolate_pos_encoding cannot be passed when using the Trainer to train models on higher-resolution images

OK, that's good to know. If many models have this it's a good reason to spend some time to figure out a solution! The most important thing is that it will work with a standard forward / backwards pass - if that's working we should be able to find a way to integrate if it's a wanted feature.

I also am not that happy I named it interpolate_pos_encoding, should have been called interpolate_position_embeddings.

Agreed interpolate_position_embeddings would have been better originally. Now interpolate_pos_encoding is implemented across the library I'd say it's better to stick with it to be consistent.

NielsRogge · 2024-05-01T16:11:51Z

Yes so the problem is that the Trainer does not allow to pass any keyword arguments to the forward of a model.

However, there's a workaround: https://discuss.huggingface.co/t/fine-tuning-vit-with-more-patches-higher-resolution/18731/4?u=nielsr

…lip vision model This commit introduces the `interpolate_pos_encoding` function to the `altclip` classes. It allows for high resolution images to be processed without image resizing. partially solves Issue huggingface#30579

`interpolate_pos_encoding` function to the `altclip` vision models. It allows for high resolution images into the model for finetunning irrespective of the pre-trained image configuration. issue huggingface#30579

the-neural-networker · 2024-05-03T01:53:32Z

I can work on deit!

jla524 · 2024-05-03T06:16:34Z

I'd like to work on vivit

faiez22222 · 2024-05-03T18:06:22Z

I can take Clip and Blip2.

Hi ashavinni
i am new to open source , can you help me little to get started with this task

davidgxue · 2024-05-03T19:06:06Z

I can work on chinese_clip. Will keep the team posted in the next few days. If I get more free time and there are remaining ones by then, happy to help out on additional tasks.

g1y5x3 · 2024-05-03T21:11:35Z

Working on detr, a bit tricky. Will explain in the PR.

davidgxue · 2024-05-03T23:15:49Z

Actually, I can also take bridgetower as well. They will come in as separate PRs though. Shouldn't be more complicated than chinese_clip.
So recap: I will work on both bridgetower and chinese_clip.

nileshkokane01 · 2024-05-04T07:04:39Z

@amyeroberts ,

How you manage this with make fix-copies , as most of the models are copied from CLIP and eventually we end up changing models that others have claimed for . I did change Git but that is copied from CLIP and that inturn triggers cascading changes.

Or avoid `make fix-copies' altogether before sending a PR?

the-neural-networker · 2024-05-05T04:04:30Z

I will work on Swin, since DeiT is already implemented.

yMayanand · 2024-05-05T12:06:13Z

I will work on owlvit.

amyeroberts · 2024-05-07T15:20:39Z

@nileshkokane01 This is a good point - I'll update the list of models to indicates models which are "grouped" together. In the case of e.g. the CLIP family, there should just be one PR opened for adding the feature to CLIP and the models which are copied from it. The steps would be:

Make the changes for CLIP
Run make fix-copies to propogate to models which copy from CLIP
Update those models so feature is properly applied to all the models
Add tests for all the affected models

davidgxue · 2024-05-07T15:41:24Z

@nileshkokane01 @amyeroberts In that case, I will refrain from working on chinese_clip and bridgetower since both have # Copied from transformers.models.clip.modeling_clip.CLIPVisionEmbeddings with CLIP in the comments. I think Kosmos 2 may also be copied from CLIP. Most likely a fair amount on the list will be inheriting from CLIP (just as a heads up to other folks)

Update: oh nice thank you Amy for updating the description to group them

davidgxue · 2024-05-07T15:48:28Z

I can take siglip. I think some functions are still copied from CLIP but just skimming it, doesn't seem like they will be related to interpolate position encoding code

zafstojano · 2024-05-07T17:07:46Z

@amyeroberts Doesn't idefics2 already handle this?

transformers/src/transformers/models/idefics2/modeling_idefics2.py

Lines 139 to 149 in cf7bed9

    
           class Idefics2VisionEmbeddings(nn.Module): 
        
               """ 
        
               This is a modified version of `siglip.modelign_siglip.SiglipVisionEmbeddings` to enable images of variable 
        
               resolution. 
        
               The modifications are adapted from [Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution](https://arxiv.org/abs/2307.06304) 
        
               which allows treating images in their native aspect ratio and without the need to resize them to the same 
        
               fixed size. In particular, we start from the original pre-trained SigLIP model 
        
               (which uses images of fixed-size square images) and adapt it by training on images of variable resolutions. 
        
               """

For example, the following sample script:

import torch
import requests
from PIL import Image
from transformers import Idefics2Processor, Idefics2ForConditionalGeneration

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

url = "https://upload.wikimedia.org/wikipedia/commons/c/cc/ESC_large_ISS022_ISS022-E-11387-edit_01.JPG"
images = [Image.open(requests.get(url, stream=True).raw)]
messages = [{
    "role": "user",
    "content": [
        {"type": "text", "text": "What's on the image?"},
        {"type": "image"},
    ],
}]

processor = Idefics2Processor.from_pretrained("HuggingFaceM4/idefics2-8b", do_image_splitting=False)
# Instead of the default 980, allow the largest edge to be 1500
processor.image_processor.size["longest_edge"] = 1500 

model = Idefics2ForConditionalGeneration.from_pretrained("HuggingFaceM4/idefics2-8b").to(device)

text = processor.apply_chat_template(messages)
inputs = processor(text=text, images=images, return_tensors="pt", padding=True)
for k, v in inputs.items():
    inputs[k] = v.to(device)

print("Input image shape:", inputs["pixel_values"].shape)

with torch.no_grad():
    out = model(**inputs)

print("Finished!")

Executes without errors and prints the following:

Loading checkpoint shards: 100%|████████████████████████| 7/7 [00:03<00:00,  2.26it/s]
Input image shape: torch.Size([1, 1, 3, 994, 1500])
Finished!

bhuvanmdev · 2024-05-08T03:33:32Z

Since all clip like models can just borrow changes made to clip model, I will take tvp instead of altclip.

amyeroberts · 2024-05-08T09:36:06Z

@zafstojano Indeed! That's what I get for doing a quick grep and not double checking. Thanks for showing an example to verify. I'll take it off the list

zafstojano · 2024-05-08T20:59:27Z

Opened a PR (#30722) addressing this issue for the BLIP family of models (BLIP, BLIP2, InstructBLIP).

kyrajeep · 2024-05-20T13:56:51Z

@amyeroberts I would like to work on DETR. Is anyone working on it?

g1y5x3 · 2024-05-20T14:29:19Z

@amyeroberts I would like to work on DETR. Is anyone working on it?

I'm almost done. Was busy with work in the past 2 weeks.

M-Ali-ML · 2024-05-21T14:32:44Z

I'll be working on grounding_dino and hopefuly I will have a PR soon.

amyeroberts · 2024-05-21T19:48:51Z

@MightyStud Thanks for picking a model and working to add this feature! After reviewing #30921, I realised that this isn't something we can add for models with backbones, which includes grounding DINO and DETR related models. I've updated the list to reflect this.

M-Ali-ML · 2024-05-21T21:48:48Z

@amyeroberts Aha, thanks for letting me know, I'd like to work on swin2sr then since I already allocated time this week.

OmarManzoor · 2024-05-23T11:45:23Z

Hi @amyeroberts Can I try out beit or data2vec?

amyeroberts · 2024-05-23T12:44:22Z

@OmarManzoor Certainly!

kishore-s-15 · 2024-05-28T04:31:00Z

@amyeroberts Is there any model that I can work on in this task?

amyeroberts · 2024-05-28T10:51:17Z

@kishore-s-15 There is currently no open PR for deit

kishore-s-15 · 2024-05-28T22:07:52Z

Thanks, @amyeroberts, I would love to work on it. Could you assign it for me?

p-kris10 · 2024-05-30T14:42:32Z

@amyeroberts have opened a PR(#31131) for deit

jacksoncamp42 · 2024-06-02T23:39:17Z

@amyeroberts Are there any models I can work on for this task?

amyeroberts · 2024-06-03T11:49:10Z

@jacksoncamp42 All models currently have open PRs. If you're interested in adding features to vision models, another way to contribute would be adding enabling device_map=auto: #29786

jacksoncamp42 · 2024-06-03T17:31:19Z

@amyeroberts Thanks for the suggestion. Unfortunately, I currently don't have access to a multi-GPU environment.

Is there another area or feature that I can contribute to without needing a multi-GPU setup?

amyeroberts · 2024-06-03T20:03:08Z

@jacksoncamp42 Anyone in the community is welcome to tackle any issue within the library. For people who are contributing for the first time, we suggest looking for issues with the Good first issue or Good second issue tags

manuelsh · 2024-09-25T20:34:15Z

CLIP family models have been tackled (and merged) here: #32600

kshitij0-0 · 2024-10-20T13:02:23Z

@amyeroberts , can you please have a look at the PR #34268 which adds interpolation in the owlvit models.

Thanks

amyeroberts added Good First Issue Vision labels Apr 30, 2024

jla524 mentioned this issue May 3, 2024

Enable dynamic resolution for vivit #30630

Merged

bhuvanmdev mentioned this issue May 3, 2024

Interpolate pos encode for altclip #30635

Closed

nileshkokane01 mentioned this issue May 4, 2024

DeiT, CLIP and Git interpolation added #30649

Closed

5 tasks

the-neural-networker mentioned this issue May 5, 2024

Enable dynamic resolution input for Swin Transformer and variants #30656

Merged

bhuvanmdev mentioned this issue May 5, 2024

Interpolate pos encode vitmae #30657

Closed

davidgxue mentioned this issue May 7, 2024

Add SigLIP #26522

Merged

8 tasks

zafstojano mentioned this issue May 8, 2024

Blip dynamic input resolution #30722

Merged

5 tasks

bhuvanmdev mentioned this issue May 9, 2024

added interpolation for vitmae model in pytorch as well as tf. #30732

Merged

nileshkokane01 mentioned this issue May 13, 2024

fixes clip interpolate #30783

Closed

5 tasks

bhuvanmdev mentioned this issue May 16, 2024

interpolation added for TVP. #30863

Merged

g1y5x3 mentioned this issue May 20, 2024

add interpolate_pos_encoding for Detr and test #30921

Closed

5 tasks

g1y5x3 mentioned this issue May 23, 2024

Perceiver interpolate position embedding #30979

Merged

5 tasks

M-Ali-ML mentioned this issue May 25, 2024

Add interpolation of positional embedding to swin2sr #31024

Closed

5 tasks

OmarManzoor mentioned this issue May 27, 2024

Enable dynamic resolution input for Beit #31053

Merged

4 tasks

p-kris10 mentioned this issue May 30, 2024

Add dynamic resolution input/interpolate position embedding to deit #31131

Merged

5 tasks

kshitij0-0 mentioned this issue Oct 20, 2024

added interpolation for owlvit & owlv2. #34268

Open

manuelsh mentioned this issue Oct 25, 2024

ClipSeg broken #34415

Closed

4 tasks

bastrob mentioned this issue Nov 17, 2024

owlvit/2 dynamic input resolution. #34764

Merged

4 tasks

Community contribution: enable dynamic resolution input for more vision models. #30579

Community contribution: enable dynamic resolution input for more vision models. #30579

Comments

amyeroberts commented Apr 30, 2024 • edited by qubvel Loading

Feature request

Motivation

Your contribution

ashvinnihalani commented Apr 30, 2024

NielsRogge commented Apr 30, 2024

bhuvanmdev commented May 1, 2024 • edited Loading

amyeroberts commented May 1, 2024

NielsRogge commented May 1, 2024

the-neural-networker commented May 3, 2024

jla524 commented May 3, 2024

faiez22222 commented May 3, 2024

davidgxue commented May 3, 2024 • edited Loading

g1y5x3 commented May 3, 2024

davidgxue commented May 3, 2024

nileshkokane01 commented May 4, 2024 • edited Loading

the-neural-networker commented May 5, 2024

yMayanand commented May 5, 2024

amyeroberts commented May 7, 2024

davidgxue commented May 7, 2024 • edited Loading

davidgxue commented May 7, 2024 • edited Loading

zafstojano commented May 7, 2024

bhuvanmdev commented May 8, 2024

amyeroberts commented May 8, 2024

zafstojano commented May 8, 2024

kyrajeep commented May 20, 2024 • edited Loading

g1y5x3 commented May 20, 2024

M-Ali-ML commented May 21, 2024

amyeroberts commented May 21, 2024

M-Ali-ML commented May 21, 2024

OmarManzoor commented May 23, 2024 • edited Loading

amyeroberts commented May 23, 2024

kishore-s-15 commented May 28, 2024

amyeroberts commented May 28, 2024

kishore-s-15 commented May 28, 2024 • edited Loading

p-kris10 commented May 30, 2024

jacksoncamp42 commented Jun 2, 2024

amyeroberts commented Jun 3, 2024

jacksoncamp42 commented Jun 3, 2024

amyeroberts commented Jun 3, 2024

manuelsh commented Sep 25, 2024 • edited Loading

kshitij0-0 commented Oct 20, 2024

amyeroberts commented Apr 30, 2024 •

edited by qubvel

Loading

bhuvanmdev commented May 1, 2024 •

edited

Loading

davidgxue commented May 3, 2024 •

edited

Loading

nileshkokane01 commented May 4, 2024 •

edited

Loading

davidgxue commented May 7, 2024 •

edited

Loading

davidgxue commented May 7, 2024 •

edited

Loading

kyrajeep commented May 20, 2024 •

edited

Loading

OmarManzoor commented May 23, 2024 •

edited

Loading

kishore-s-15 commented May 28, 2024 •

edited

Loading

manuelsh commented Sep 25, 2024 •

edited

Loading