-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ClipSeg broken #34415
Comments
Note that temporary workaround for this error is just manually |
transformers/tests/models/clipseg/test_modeling_clipseg.py Lines 833 to 840 in f73f5e6
Try model(**processed, interpolate_pos_encoding=True) |
That does make it work. Note that that is not a solution as (A) an argument shouldn't be required just to work at all here, and (B) that is incompatible with older versions of transformers, so it can't even work as a workaround code patch. |
A) The argument does appear to be required in this case because
transformers/src/transformers/models/clipseg/modeling_clipseg.py Lines 193 to 198 in 53fad64
In some cases. Imo these changes might have benefited from a deprecation warning, and documentation should have been updated, but I'll leave it to maintainers to make further comment. |
I think you missed the point: the input image is not 352. It's 512x512 in my above example, but could be anything. ClipSeg's own code within transformers is resizing it to 352, then complaining that it's the wrong size. |
|
Thanks for reporting, I contributed CLIPSeg to the Transformers library some years ago. There was no I'm personally not a fan of just blindly adding this Edit, yes they just introduced a breaking change:
|
Anyway, I've opened a PR to revert it: #34419 |
All vision models, included the CLIP family models, have been modified (or are in the process to be) with the same interpolate function and with It is true that ClipSeg already had a function doing that, which works in the same way as the current function (both use bicubic interpolation with But this has not been "blindly" substituted. The benefits of the new
I believe the solution is not to revert back to the previous function, but possibly to remove the need to add @amyeroberts what do you think? |
If the behavior is identical to previous with the arg enabled, it should probably be enabled by default? Again, I need to reiterate: Running clipseg as intended will always produce different shapes that require interpolation. It is explicitly erroneous to have that argument disabled. |
@mcmonkey4eva why it is "explicitly erroneous to have the argument disabled"? Note that the positional encoding introduced affects only the input image, not the output segmentation of ClipSeg. |
I am confused why I have to keep repeating this: If you run the code as intended, it errors. The code does not work if you do not have this interpolation enabled. The size difference that it needs to interpolate between it produced by clipseg itself not by the caller code. Again, the input image can be any size you want, the HF Transformers ClipSeg code is what resizes it to 352, and then complains that the 352 it output is not the 224 it expects. |
@mcmonkey4eva You appear to be confused by the currently incorrect documentation, indeed running the code from the incorrect documentation produces an error, it is unfortunate the documentation was not updated to include Please refer to the comments above for a further explanation of why this change was made and the proposed solutions i.e. updating documentation, adding a deprecation warning, setting the default In the meantime, you can easily resolve this in your integration with one of the following options:
transformers>=4.46.0
import torch
import numpy as np
from transformers import CLIPSegProcessor, CLIPSegForImageSegmentation
import importlib.metadata
import packaging
arr = np.ones((512, 512, 3), dtype=np.uint8) * 128
arr[240:272, 240:272] = 0
processor = CLIPSegProcessor.from_pretrained("CIDAS/clipseg-rd64-refined")
model = CLIPSegForImageSegmentation.from_pretrained("CIDAS/clipseg-rd64-refined")
kwargs = (
{"interpolate_pos_encoding": True}
if packaging.version.parse(importlib.metadata.version("transformers"))
>= packaging.version.Version("4.46.0")
else {}
)
with torch.no_grad():
processed = processor(
text="the black spot", images=arr, return_tensors="pt", padding=True
)
print(f"{processed.pixel_values.shape}")
mask = model(**processed, **kwargs)[0]
import torch
import numpy as np
from transformers import CLIPSegProcessor, CLIPSegForImageSegmentation
import inspect
arr = np.ones((512, 512, 3), dtype=np.uint8) * 128
arr[240:272, 240:272] = 0
processor = CLIPSegProcessor.from_pretrained("CIDAS/clipseg-rd64-refined")
model = CLIPSegForImageSegmentation.from_pretrained("CIDAS/clipseg-rd64-refined")
kwargs = (
{"interpolate_pos_encoding": True}
if "interpolate_pos_encoding" in inspect.signature(model.forward).parameters
else {}
)
with torch.no_grad():
processed = processor(
text="the black spot", images=arr, return_tensors="pt", padding=True
)
print(f"{processed.pixel_values.shape}")
mask = model(**processed, **kwargs)[0] |
Hi, @mcmonkey4eva is not confused because of that, as this has been added to the documentation too, please check https://huggingface.co/docs/transformers/v4.46.0/en/model_doc/clipseg#transformers.CLIPSegModel where the Parameter @mcmonkey4eva, as mentioned before, in the new version you need to add You can also refer to here for an explanation of the changes. |
Please by god read what I wrote above repeatedly. This is an incredibly frustrating attempt at communication. |
Let me just lay this out nice and neatly so that everyone can get on the same page. (I'll use "pos embed" to refer to the positional encoding/embedding just for brevity's sake)
This was the status quo for quite some time.
This is a breaking user-facing API change that is, IMO, entirely unjustified. Merely updating the docs to say "you have to pass The question is not "how do i get my code to work now"; @mcmonkey4eva certainly doesn't need help with that given the codebases he maintains. The question is "Can we fix this unnecessary breakage of existing code and unintuitive change in default behaviour?" The simple and straightforward solution here would be to just change the default for A slightly more complicated one would be to make the default value of this arg a model configuration property, with the default overridden in the model class. |
Thanks for the thorough recap of the fundamentals, @neggles! In that vain, I'll make some further points:
That PR added the function to various ViT models and replaced it only in ClipSeg. This was part of a greater project #30579 to enable dynamic resolution input for more vision models. Other benefits are explained here, namely TorchScript compatability.
Yes, as we've said it probably should have been accompanied by a deprecation warning.
The burden of workarounds is simply part and parcel of being a developer, Transformers itself has several. If you'd like the codebase to work in the meantime, please refer to the solutions here
Yes, please refer to the proposed solutions here and here.
Indeed, this is one of the proposed solutions. |
Good take @neggles, I agree with you! Thanks also @manuelsh for clarifying the background. I believe making However @manuelsh I saw the logits were updated in the integration test, that should not be the case, logits should remain the same to ensure backwards compatibility when passing |
@hlky Fair point! I was on my phone so I didn't have the best view of that PR, but my comments still very much apply to the combination of #32600 and #30579 taken together 😝 I'm quite aware of the benefits of the change - and entirely in favor of the overall change, FWIW - but there appeared to be some miscommunication W.R.T. @mcmonkey4eva's purpose with opening this PR, so it seemed like a good idea to try and clear that up.
Well yeah, but a lot of downstream users of this library are data scientists, not developers, and just because python ML library dependency management is already a dumpster fire doesn't mean we should make it worse without a good reason! One of the stated goals of @NielsRogge Thanks! I agree. Seems like the lowest-effort lowest-impact most-obvious choice, and as a bonus it won't cause any problems for anyone who does happen to have implemented a workaround in the meantime. FWIW, it looks like the actual logit change in #32600 was fairly insignificant: expected_masks_slice = torch.tensor(
- [[-7.4613, -7.4785, -7.3628], [-7.3268, -7.0899, -7.1333], [-6.9838, -6.7900, -6.8913]]
+ [[-7.4613, -7.4785, -7.3627], [-7.3268, -7.0898, -7.1333], [-6.9838, -6.7900, -6.8913]]
).to(torch_device) Probably just down to rounding changes, chasing that down is maybe not entirely worth the effort 🤷♀️ |
@mcmonkey4eva, I understand the frustration with the breaking change in ClipSeg requiring an extra argument. I’m addressing this and appreciate everyone’s pragmatic solution to default What’s unclear is whether we’re keeping the new |
@manuelsh we're usually pretty strict on logits matching, up to atol=1e-4 to 1e-6. So perhaps we could modify the method to keep the old behaviour of interpolation (which matches the original one). |
Hey all! 🤗
I completely agree with you, and am sorry that this escaped our reviews, we'll patch this ASAP. @NielsRogge I think logits depend on the hardware you are using, the one making the PR might not have the same, and the difference is acceptable! |
Hi @manuelsh the issue is fixed, apologies for my comment earlier, and thanks for the great explainer. |
No worries @NielsRogge, I am glad we landed on a good solution. |
I have 4.46.2 and this is still broken, is that to be expected? Which version has the fix? :) |
I have this error in Pinokio ComfyUI. I understand I need to install (downgrade) to transformers==4.45.0. I did that with anaconda and pip install transformers==4.45.0 but in E:\pinokio\api\comfy.git\app or where in which folder should it be installed? I also didn't find any test_modeling_clipseg.py, only a modeling_clipseg.py. |
In which folder do I have to run this command in cmd or anaconda "pip install transformers==4.45.0" (I use PINOKIO ComfyUI) |
System Info
irrelevant, happens on any env with updated transformers, replicated myself on
Transformers v4.46.0
(current release on pip at time of writing)Who can help?
@amyeroberts @manuelsh
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
pip install transformers==4.46.0
(latest currently)Here's a fully contained python file if you want a lazy replication:
Expected behavior
Expected behavior: it works, as it did before updating transformers
But instead, it fails with
Input image size (352*352) doesn't match model (224*224).
See my comment on the PR that broke it: https://github.com/huggingface/transformers/pull/32600/files#r1816732699
Secondarily, there's now a warning about one of the arguments (That, again, is in the docs and has been there for years):
UserWarning: The following named arguments are not valid for 'ViTImageProcessor.preprocess' and were ignored: 'padding'
The text was updated successfully, but these errors were encountered: