-
Notifications
You must be signed in to change notification settings - Fork 27.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SigLIP #26522
Add SigLIP #26522
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
6713474
to
48db516
Compare
7b7608f
to
0e6e9c0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a fiinal review! Hope it helps
@ArthurZucker I added 26590d2 for Also, this isn't supported by To me it feels a bit weird to have this behaviour by default to match the original implementation, since any original implementation won't ever keep special tokens. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this new model!
Feel free to merge, the last test would be a nice to have, only thing to adress is the padding max length that you force in the pipeline
input_ids = self.tokenizer.encode("▁He is not ▁He") | ||
self.assertEqual(input_ids, [37, 46, 44, 37, 2]) | ||
tokens = self.tokenizer.tokenize("▁He is not ▁He") | ||
self.assertEqual(tokens, ["▁he", "▁is", "▁not", "▁he"]) # spaces are eaten by spm even if not start |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last thing is this !
* Add first draft * Use appropriate gelu function * More improvements * More improvements * More improvements * Convert checkpoint * More improvements * Improve docs, remove print statements * More improvements * Add link * remove unused masking function * begin tokenizer * do_lower_case * debug * set split_special_tokens=True * Remove script * Fix style * Fix rebase * Use same design as CLIP * Add fast tokenizer * Add SiglipTokenizer to init, remove extra_ids * Improve conversion script * Use smaller inputs in conversion script * Update conversion script * More improvements * Add processor to conversion script * Add tests * Remove print statements * Add tokenizer tests * Fix more tests * More improvements related to weight initialization * More improvements * Make more tests pass * More improvements * More improvements * Add copied from * Add canonicalize_text * Enable fast tokenizer tests * More improvements * Fix most slow tokenizer tests * Address comments * Fix style * Remove script * Address some comments * Add copied from to tests * Add more copied from * Add more copied from * Add more copied from * Remove is_flax_available * More updates * Address comment * Remove SiglipTokenizerFast for now * Add caching * Remove umt5 test * Add canonicalize_text inside _tokenize, thanks Arthur * Fix image processor tests * Skip tests which are not applicable * Skip test_initialization * More improvements * Compare pixel values * Fix doc tests, add integration test * Add do_normalize * Remove causal mask and leverage ignore copy * Fix attention_mask * Fix remaining tests * Fix dummies * Rename temperature and bias * Address comments * Add copied from to tokenizer tests * Add SiglipVisionModel to auto mapping * Add copied from to image processor tests * Improve doc * Remove SiglipVisionModel from index * Address comments * Improve docs * Simplify config * Add first draft * Make it like mistral * More improvements * Fix attention_mask * Fix output_attentions * Add note in docs * Convert multilingual model * Convert large checkpoint * Convert more checkpoints * Add pipeline support, correct image_mean and image_std * Use padding=max_length by default * Make processor like llava * Add code snippet * Convert more checkpoints * Set keep_punctuation_string=None as in OpenCLIP * Set normalized=False for special tokens * Fix doc test * Update integration test * Add figure * Update organization * Happy new year * Use AutoModel everywhere --------- Co-authored-by: patil-suraj <surajp815@gmail.com>
thanks for adding this! is there a reason why Perhaps it refers to >>> processor.tokenizer(["hello bonjour", "bonjour"], padding=True, return_attention_mask=True)
{'input_ids': [[14647, 10048, 20852, 1], [10048, 20852, 1, 1]], 'attention_mask': [[1, 1, 1, 1], [1, 1, 1, 0]]}
>>> processor(text=["hello bonjour", "bonjour"], padding=True, return_attention_mask=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: __call__() got an unexpected keyword argument 'return_attention_mask' it looks like |
Hi @VictorSanh, SigLIP was trained without We still provide the possibility to create an Regarding the |
got it, i didn't see the issue. |
@VictorSanh another thing to note (which tripped me up), is that you need to use |
interesting, thanks for the info these are rather odd behaviors (in comparison to what other tokenizers & models behave). do you think we can display that info somewhere? in the doc or the model card for instance. |
* Add first draft * Use appropriate gelu function * More improvements * More improvements * More improvements * Convert checkpoint * More improvements * Improve docs, remove print statements * More improvements * Add link * remove unused masking function * begin tokenizer * do_lower_case * debug * set split_special_tokens=True * Remove script * Fix style * Fix rebase * Use same design as CLIP * Add fast tokenizer * Add SiglipTokenizer to init, remove extra_ids * Improve conversion script * Use smaller inputs in conversion script * Update conversion script * More improvements * Add processor to conversion script * Add tests * Remove print statements * Add tokenizer tests * Fix more tests * More improvements related to weight initialization * More improvements * Make more tests pass * More improvements * More improvements * Add copied from * Add canonicalize_text * Enable fast tokenizer tests * More improvements * Fix most slow tokenizer tests * Address comments * Fix style * Remove script * Address some comments * Add copied from to tests * Add more copied from * Add more copied from * Add more copied from * Remove is_flax_available * More updates * Address comment * Remove SiglipTokenizerFast for now * Add caching * Remove umt5 test * Add canonicalize_text inside _tokenize, thanks Arthur * Fix image processor tests * Skip tests which are not applicable * Skip test_initialization * More improvements * Compare pixel values * Fix doc tests, add integration test * Add do_normalize * Remove causal mask and leverage ignore copy * Fix attention_mask * Fix remaining tests * Fix dummies * Rename temperature and bias * Address comments * Add copied from to tokenizer tests * Add SiglipVisionModel to auto mapping * Add copied from to image processor tests * Improve doc * Remove SiglipVisionModel from index * Address comments * Improve docs * Simplify config * Add first draft * Make it like mistral * More improvements * Fix attention_mask * Fix output_attentions * Add note in docs * Convert multilingual model * Convert large checkpoint * Convert more checkpoints * Add pipeline support, correct image_mean and image_std * Use padding=max_length by default * Make processor like llava * Add code snippet * Convert more checkpoints * Set keep_punctuation_string=None as in OpenCLIP * Set normalized=False for special tokens * Fix doc test * Update integration test * Add figure * Update organization * Happy new year * Use AutoModel everywhere --------- Co-authored-by: patil-suraj <surajp815@gmail.com>
@VictorSanh Behaviour and doc examples were updated in #28578 |
thank you! |
Hi, could someone explain why you chose to use Bicubic interpolation over Bilinear ones for the resizing of the images? In the official BigVision repo, I find bilinear methods but not bicubic ones. |
* Add first draft * Use appropriate gelu function * More improvements * More improvements * More improvements * Convert checkpoint * More improvements * Improve docs, remove print statements * More improvements * Add link * remove unused masking function * begin tokenizer * do_lower_case * debug * set split_special_tokens=True * Remove script * Fix style * Fix rebase * Use same design as CLIP * Add fast tokenizer * Add SiglipTokenizer to init, remove extra_ids * Improve conversion script * Use smaller inputs in conversion script * Update conversion script * More improvements * Add processor to conversion script * Add tests * Remove print statements * Add tokenizer tests * Fix more tests * More improvements related to weight initialization * More improvements * Make more tests pass * More improvements * More improvements * Add copied from * Add canonicalize_text * Enable fast tokenizer tests * More improvements * Fix most slow tokenizer tests * Address comments * Fix style * Remove script * Address some comments * Add copied from to tests * Add more copied from * Add more copied from * Add more copied from * Remove is_flax_available * More updates * Address comment * Remove SiglipTokenizerFast for now * Add caching * Remove umt5 test * Add canonicalize_text inside _tokenize, thanks Arthur * Fix image processor tests * Skip tests which are not applicable * Skip test_initialization * More improvements * Compare pixel values * Fix doc tests, add integration test * Add do_normalize * Remove causal mask and leverage ignore copy * Fix attention_mask * Fix remaining tests * Fix dummies * Rename temperature and bias * Address comments * Add copied from to tokenizer tests * Add SiglipVisionModel to auto mapping * Add copied from to image processor tests * Improve doc * Remove SiglipVisionModel from index * Address comments * Improve docs * Simplify config * Add first draft * Make it like mistral * More improvements * Fix attention_mask * Fix output_attentions * Add note in docs * Convert multilingual model * Convert large checkpoint * Convert more checkpoints * Add pipeline support, correct image_mean and image_std * Use padding=max_length by default * Make processor like llava * Add code snippet * Convert more checkpoints * Set keep_punctuation_string=None as in OpenCLIP * Set normalized=False for special tokens * Fix doc test * Update integration test * Add figure * Update organization * Happy new year * Use AutoModel everywhere --------- Co-authored-by: patil-suraj <surajp815@gmail.com>
@NielsRogge good motivation to fill out #28180 |
) | ||
|
||
self.num_patches = (self.image_size // self.patch_size) ** 2 | ||
self.num_positions = self.num_patches |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NielsRogge Quick question if you don't mind: why is self.num_positions = self.num_patches
? CLIP and other vision transformers has +1
for num of positions. Is it because there's not CLASS embedding/token? I dug super deep and realized the original implementation from here google-research/vision_transformer#61 they add a CLASS embedding to keep it consistent as the generic Transformers architecture.
I thought well, maybe SigLIP is different. Maybe Google implemented it without the CLASS embedding this time. Hence why it doesn't have the +1
. But from Google's repo (https://github.com/google-research/big_vision/blob/d0b297bbb8e073861d2d09ebd57ca46496357e36/big_vision/configs/proj/image_text/siglip_lit_coco.py#L81), this line has the pool type set to tok
which if you go to their vit
modeling file, adds the CLASS embedding still.
So... I am guessing it just so happens you didn't add the CLASS token/embedding in this implementation? But I am trying to figure out the reasoning behind it. Perhaps there's some code you followed that does that?
Validating my thoery: I also see you don't have self.class_embedding
declared in the SiglipVisionEmbeddings
class, whereas models like CLIP or VIT has it initialized for their embeddings class.
I am trying to work on #30579 for SigLIP
, so trying to understand it better. If class embedding isn't added then the interpolate function would differ slightly. I think it maybe safe assume you are not adding the CLASS token just based on this implementation, but not 100% sure. Confirming to be safe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think SigLIP has a CLS token indeed.
@NielsRogge If you don't mind answering a question on this:
The docs say to use the SigLIP loss from the |
@rootAvish feel free to open a PR, I'm not sure it would be equivalent in that case (e.g. if you would then use the 🤗 Trainer API and run on multiple devices, are gradients synced in the same way?). |
What does this PR do?
This PR adds Google's new SigLIP model (CLIP with a better loss function). It's based on the Google Colab provided by the authors.
cc @patil-suraj feel free to take over this one
To do:
T5Tokenizer
? The vocab is defined here)torch.distributed
utilities would have to be incorporatedSiglipVisionModel
can be properly loaded withoutfrom_pretrained
complainingattention_mask
is not passed for siglip checkpoints by updatingmodel_input_names
for checkpointssplit_special_tokens=True
? => no but users can pass this flag