Add SigLIP #26522

NielsRogge · 2023-10-01T17:49:58Z

What does this PR do?

This PR adds Google's new SigLIP model (CLIP with a better loss function). It's based on the Google Colab provided by the authors.

cc @patil-suraj feel free to take over this one

To do:

add SiglipTokenizer (or use T5Tokenizer ? The vocab is defined here)
add tests for the image processor, tokenizer and processor
add fast tokenizer and enable fast tokenizer tests => skip fast tokenizer for now, see branch add_siglip_fast_tokenizer
add loss function for training => won't do since various torch.distributed utilities would have to be incorporated
important one: make sure that weights of SiglipVisionModel can be properly loaded without from_pretrained complaining
make sure attention_mask is not passed for siglip checkpoints by updating model_input_names for checkpoints
set split_special_tokens=True? => no but users can pass this flag
transfer checkpoints, update organization name

HuggingFaceDocBuilderDev · 2023-10-01T18:06:34Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

github-actions · 2023-11-01T08:04:05Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ArthurZucker

Not a fiinal review! Hope it helps

src/transformers/models/siglip/tokenization_siglip.py

src/transformers/models/siglip/tokenization_siglip_fast.py

src/transformers/models/siglip/test.py

src/transformers/convert_slow_tokenizer.py

tests/models/siglip/test_tokenization_siglip.py

NielsRogge · 2024-01-02T15:19:34Z

@ArthurZucker I added 26590d2 for split_special_tokens=True which required me to overwrite some tests of tokenization_common.py. Could you have a look?

Also, this isn't supported by tokenizers yet right?

To me it feels a bit weird to have this behaviour by default to match the original implementation, since any original implementation won't ever keep special tokens.

ArthurZucker

Thanks for adding this new model!
Feel free to merge, the last test would be a nice to have, only thing to adress is the padding max length that you force in the pipeline

src/transformers/models/siglip/__init__.py

src/transformers/models/siglip/tokenization_siglip.py

src/transformers/pipelines/zero_shot_image_classification.py

ArthurZucker · 2024-01-08T16:07:50Z

tests/models/siglip/test_tokenization_siglip.py

+        input_ids = self.tokenizer.encode("▁He is not             ▁He")
+        self.assertEqual(input_ids, [37, 46, 44, 37, 2])
+        tokens = self.tokenizer.tokenize("▁He is not              ▁He")
+        self.assertEqual(tokens, ["▁he", "▁is", "▁not", "▁he"])  # spaces are eaten by spm even if not start


Last thing is this !

* Add first draft * Use appropriate gelu function * More improvements * More improvements * More improvements * Convert checkpoint * More improvements * Improve docs, remove print statements * More improvements * Add link * remove unused masking function * begin tokenizer * do_lower_case * debug * set split_special_tokens=True * Remove script * Fix style * Fix rebase * Use same design as CLIP * Add fast tokenizer * Add SiglipTokenizer to init, remove extra_ids * Improve conversion script * Use smaller inputs in conversion script * Update conversion script * More improvements * Add processor to conversion script * Add tests * Remove print statements * Add tokenizer tests * Fix more tests * More improvements related to weight initialization * More improvements * Make more tests pass * More improvements * More improvements * Add copied from * Add canonicalize_text * Enable fast tokenizer tests * More improvements * Fix most slow tokenizer tests * Address comments * Fix style * Remove script * Address some comments * Add copied from to tests * Add more copied from * Add more copied from * Add more copied from * Remove is_flax_available * More updates * Address comment * Remove SiglipTokenizerFast for now * Add caching * Remove umt5 test * Add canonicalize_text inside _tokenize, thanks Arthur * Fix image processor tests * Skip tests which are not applicable * Skip test_initialization * More improvements * Compare pixel values * Fix doc tests, add integration test * Add do_normalize * Remove causal mask and leverage ignore copy * Fix attention_mask * Fix remaining tests * Fix dummies * Rename temperature and bias * Address comments * Add copied from to tokenizer tests * Add SiglipVisionModel to auto mapping * Add copied from to image processor tests * Improve doc * Remove SiglipVisionModel from index * Address comments * Improve docs * Simplify config * Add first draft * Make it like mistral * More improvements * Fix attention_mask * Fix output_attentions * Add note in docs * Convert multilingual model * Convert large checkpoint * Convert more checkpoints * Add pipeline support, correct image_mean and image_std * Use padding=max_length by default * Make processor like llava * Add code snippet * Convert more checkpoints * Set keep_punctuation_string=None as in OpenCLIP * Set normalized=False for special tokens * Fix doc test * Update integration test * Add figure * Update organization * Happy new year * Use AutoModel everywhere --------- Co-authored-by: patil-suraj <surajp815@gmail.com>

VictorSanh · 2024-01-18T21:52:08Z

thanks for adding this!

is there a reason why processor(text=["hello bonjour", "bonjour"], return_tensors="pt", padding=True) does not return any attention mask?

Perhaps it refers to make sure attention_mask is not passed for siglip checkpoints by updating model_input_names for checkpoints but i am not sure i understand why

>>> processor.tokenizer(["hello bonjour", "bonjour"], padding=True, return_attention_mask=True)
{'input_ids': [[14647, 10048, 20852, 1], [10048, 20852, 1, 1]], 'attention_mask': [[1, 1, 1, 1], [1, 1, 1, 0]]}
>>> processor(text=["hello bonjour", "bonjour"], padding=True, return_attention_mask=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: __call__() got an unexpected keyword argument 'return_attention_mask'

it looks like return_attention_mask is not passed to the tokenizer in the call to the processor

NielsRogge · 2024-01-18T23:28:39Z

Hi @VictorSanh, SigLIP was trained without attention_mask (said differently, the text encoder attends to all tokens, including padding tokens!). Hence I explicitly had to set model_input_names only to "input_ids" for the checkpoints on the hub such that the model will internally attend to all tokens.

We still provide the possibility to create an attention_mask if you want padding tokens to be ignored, although predictions with the existing checkpoints will be pretty bad as that's not how those were trained.

Regarding the return_attention_mask argument not being passed to the tokenizer, indeed that's not supported yet. I'll add it as part of #28578

VictorSanh · 2024-01-19T02:34:42Z

got it, i didn't see the issue.
that's quite odd that attention mask was not used

xenova · 2024-01-19T10:20:19Z

@VictorSanh another thing to note (which tripped me up), is that you need to use padding='max_length'... otherwise, the output differs wildly (see here for more info).

VictorSanh · 2024-01-19T14:55:29Z

interesting, thanks for the info

these are rather odd behaviors (in comparison to what other tokenizers & models behave). do you think we can display that info somewhere? in the doc or the model card for instance.

* Add first draft * Use appropriate gelu function * More improvements * More improvements * More improvements * Convert checkpoint * More improvements * Improve docs, remove print statements * More improvements * Add link * remove unused masking function * begin tokenizer * do_lower_case * debug * set split_special_tokens=True * Remove script * Fix style * Fix rebase * Use same design as CLIP * Add fast tokenizer * Add SiglipTokenizer to init, remove extra_ids * Improve conversion script * Use smaller inputs in conversion script * Update conversion script * More improvements * Add processor to conversion script * Add tests * Remove print statements * Add tokenizer tests * Fix more tests * More improvements related to weight initialization * More improvements * Make more tests pass * More improvements * More improvements * Add copied from * Add canonicalize_text * Enable fast tokenizer tests * More improvements * Fix most slow tokenizer tests * Address comments * Fix style * Remove script * Address some comments * Add copied from to tests * Add more copied from * Add more copied from * Add more copied from * Remove is_flax_available * More updates * Address comment * Remove SiglipTokenizerFast for now * Add caching * Remove umt5 test * Add canonicalize_text inside _tokenize, thanks Arthur * Fix image processor tests * Skip tests which are not applicable * Skip test_initialization * More improvements * Compare pixel values * Fix doc tests, add integration test * Add do_normalize * Remove causal mask and leverage ignore copy * Fix attention_mask * Fix remaining tests * Fix dummies * Rename temperature and bias * Address comments * Add copied from to tokenizer tests * Add SiglipVisionModel to auto mapping * Add copied from to image processor tests * Improve doc * Remove SiglipVisionModel from index * Address comments * Improve docs * Simplify config * Add first draft * Make it like mistral * More improvements * Fix attention_mask * Fix output_attentions * Add note in docs * Convert multilingual model * Convert large checkpoint * Convert more checkpoints * Add pipeline support, correct image_mean and image_std * Use padding=max_length by default * Make processor like llava * Add code snippet * Convert more checkpoints * Set keep_punctuation_string=None as in OpenCLIP * Set normalized=False for special tokens * Fix doc test * Update integration test * Add figure * Update organization * Happy new year * Use AutoModel everywhere --------- Co-authored-by: patil-suraj <surajp815@gmail.com>

amyeroberts · 2024-01-22T14:53:18Z

@VictorSanh Behaviour and doc examples were updated in #28578

VictorSanh · 2024-01-22T16:05:16Z

thank you!

HugoLaurencon · 2024-01-22T17:49:37Z

Hi, could someone explain why you chose to use Bicubic interpolation over Bilinear ones for the resizing of the images? In the official BigVision repo, I find bilinear methods but not bicubic ones.
https://github.com/google-research/big_vision/blob/main/big_vision/pp/ops_image.py

* Add first draft * Use appropriate gelu function * More improvements * More improvements * More improvements * Convert checkpoint * More improvements * Improve docs, remove print statements * More improvements * Add link * remove unused masking function * begin tokenizer * do_lower_case * debug * set split_special_tokens=True * Remove script * Fix style * Fix rebase * Use same design as CLIP * Add fast tokenizer * Add SiglipTokenizer to init, remove extra_ids * Improve conversion script * Use smaller inputs in conversion script * Update conversion script * More improvements * Add processor to conversion script * Add tests * Remove print statements * Add tokenizer tests * Fix more tests * More improvements related to weight initialization * More improvements * Make more tests pass * More improvements * More improvements * Add copied from * Add canonicalize_text * Enable fast tokenizer tests * More improvements * Fix most slow tokenizer tests * Address comments * Fix style * Remove script * Address some comments * Add copied from to tests * Add more copied from * Add more copied from * Add more copied from * Remove is_flax_available * More updates * Address comment * Remove SiglipTokenizerFast for now * Add caching * Remove umt5 test * Add canonicalize_text inside _tokenize, thanks Arthur * Fix image processor tests * Skip tests which are not applicable * Skip test_initialization * More improvements * Compare pixel values * Fix doc tests, add integration test * Add do_normalize * Remove causal mask and leverage ignore copy * Fix attention_mask * Fix remaining tests * Fix dummies * Rename temperature and bias * Address comments * Add copied from to tokenizer tests * Add SiglipVisionModel to auto mapping * Add copied from to image processor tests * Improve doc * Remove SiglipVisionModel from index * Address comments * Improve docs * Simplify config * Add first draft * Make it like mistral * More improvements * Fix attention_mask * Fix output_attentions * Add note in docs * Convert multilingual model * Convert large checkpoint * Convert more checkpoints * Add pipeline support, correct image_mean and image_std * Use padding=max_length by default * Make processor like llava * Add code snippet * Convert more checkpoints * Set keep_punctuation_string=None as in OpenCLIP * Set normalized=False for special tokens * Fix doc test * Update integration test * Add figure * Update organization * Happy new year * Use AutoModel everywhere --------- Co-authored-by: patil-suraj <surajp815@gmail.com>

amyeroberts · 2024-01-23T15:27:59Z

Hi, could someone explain why you chose to use Bicubic interpolation over Bilinear ones for the resizing of the images? In the official BigVision repo, I find bilinear methods but not bicubic ones. https://github.com/google-research/big_vision/blob/main/big_vision/pp/ops_image.py

@NielsRogge good motivation to fill out #28180

davidgxue · 2024-05-07T19:50:46Z

src/transformers/models/siglip/modeling_siglip.py

+        )
+
+        self.num_patches = (self.image_size // self.patch_size) ** 2
+        self.num_positions = self.num_patches


@NielsRogge Quick question if you don't mind: why is self.num_positions = self.num_patches? CLIP and other vision transformers has +1 for num of positions. Is it because there's not CLASS embedding/token? I dug super deep and realized the original implementation from here google-research/vision_transformer#61 they add a CLASS embedding to keep it consistent as the generic Transformers architecture.

I thought well, maybe SigLIP is different. Maybe Google implemented it without the CLASS embedding this time. Hence why it doesn't have the +1. But from Google's repo (https://github.com/google-research/big_vision/blob/d0b297bbb8e073861d2d09ebd57ca46496357e36/big_vision/configs/proj/image_text/siglip_lit_coco.py#L81), this line has the pool type set to tok which if you go to their vit modeling file, adds the CLASS embedding still.

So... I am guessing it just so happens you didn't add the CLASS token/embedding in this implementation? But I am trying to figure out the reasoning behind it. Perhaps there's some code you followed that does that?
Validating my thoery: I also see you don't have self.class_embedding declared in the SiglipVisionEmbeddings class, whereas models like CLIP or VIT has it initialized for their embeddings class.

I am trying to work on #30579 for SigLIP, so trying to understand it better. If class embedding isn't added then the interpolate function would differ slightly. I think it maybe safe assume you are not adding the CLASS token just based on this implementation, but not 100% sure. Confirming to be safe

I don't think SigLIP has a CLS token indeed.

rootAvish · 2024-05-30T17:26:24Z

@NielsRogge If you don't mind answering a question on this:

add loss function for training => won't do since various torch.distributed utilities would have to be incorporated

The docs say to use the SigLIP loss from the open_clip repository. Going over the code from open_clip I see that that it is indeed implemented using a lot of functionality from torch.distributed, but I didn't quite understand why this one can't be implemented without torch.distributed, the function in open_clip also allows for a world size of 1 right (which is equivalent to not using torch.distributed)?

NielsRogge · 2024-06-01T12:43:09Z

@rootAvish feel free to open a PR, I'm not sure it would be equivalent in that case (e.g. if you would then use the 🤗 Trainer API and run on multiple devices, are gradients synced in the same way?).

github-actions bot closed this Nov 9, 2023

NielsRogge reopened this Nov 12, 2023

NielsRogge force-pushed the add_siglip branch from 6713474 to 48db516 Compare November 20, 2023 09:49

NielsRogge mentioned this pull request Nov 20, 2023

Getting equivalent results between Transformer's resize and tf.image.resize #27601

Open

NielsRogge force-pushed the add_siglip branch from 7b7608f to 0e6e9c0 Compare November 21, 2023 18:33

ArthurZucker reviewed Nov 29, 2023

View reviewed changes

NielsRogge and others added 22 commits November 30, 2023 08:57

Add first draft

091ffff

Use appropriate gelu function

7e66228

More improvements

d341010

More improvements

432ec56

More improvements

3d700ae

Convert checkpoint

61e98d1

More improvements

4cfb533

Improve docs, remove print statements

7bd5816

More improvements

85e8efa

Add link

5017a6a

remove unused masking function

f07af7c

begin tokenizer

6496041

do_lower_case

fcd729f

debug

3cc8b4f

set split_special_tokens=True

9e02a78

Remove script

13f3f81

Fix style

c381838

Fix rebase

82cd61b

Use same design as CLIP

35fb3dc

Add fast tokenizer

eb8748b

Add SiglipTokenizer to init, remove extra_ids

59a616c

Improve conversion script

da43b79

NielsRogge added 2 commits December 28, 2023 11:04

Set keep_punctuation_string=None as in OpenCLIP

985a922

Set normalized=False for special tokens

ffdcfc5

NielsRogge force-pushed the add_siglip branch from 4320b1c to ffdcfc5 Compare January 2, 2024 13:32

Fix doc test

f678d4c

ArthurZucker self-requested a review January 3, 2024 16:09

Update integration test

e8b8230

NielsRogge force-pushed the add_siglip branch from 2388ec3 to e8b8230 Compare January 8, 2024 08:24

NielsRogge added 2 commits January 8, 2024 15:13

Add figure

ed40ec5

Update organization

8c83814

ArthurZucker approved these changes Jan 8, 2024

View reviewed changes

NielsRogge added 2 commits January 8, 2024 17:16

Happy new year

4a0e70c

Use AutoModel everywhere

5e3df73

ArthurZucker merged commit 3b742ea into huggingface:main Jan 8, 2024
23 checks passed

davidgxue reviewed May 7, 2024

View reviewed changes

davidgxue mentioned this pull request May 8, 2024

Add dynamic resolution input/interpolate position embedding to SigLIP #30719

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SigLIP #26522

Add SigLIP #26522

NielsRogge commented Oct 1, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Oct 1, 2023

github-actions bot commented Nov 1, 2023

ArthurZucker left a comment

NielsRogge commented Jan 2, 2024 •

edited

Loading

ArthurZucker left a comment

ArthurZucker Jan 8, 2024

VictorSanh commented Jan 18, 2024 •

edited

Loading

NielsRogge commented Jan 18, 2024

VictorSanh commented Jan 19, 2024

xenova commented Jan 19, 2024 •

edited

Loading

VictorSanh commented Jan 19, 2024

amyeroberts commented Jan 22, 2024 •

edited

Loading

VictorSanh commented Jan 22, 2024

HugoLaurencon commented Jan 22, 2024

amyeroberts commented Jan 23, 2024

davidgxue May 7, 2024 •

edited

Loading

NielsRogge May 8, 2024

rootAvish commented May 30, 2024 •

edited

Loading

NielsRogge commented Jun 1, 2024

Add SigLIP #26522

Add SigLIP #26522

Conversation

NielsRogge commented Oct 1, 2023 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Oct 1, 2023

github-actions bot commented Nov 1, 2023

ArthurZucker left a comment

Choose a reason for hiding this comment

NielsRogge commented Jan 2, 2024 • edited Loading

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Jan 8, 2024

Choose a reason for hiding this comment

VictorSanh commented Jan 18, 2024 • edited Loading

NielsRogge commented Jan 18, 2024

VictorSanh commented Jan 19, 2024

xenova commented Jan 19, 2024 • edited Loading

VictorSanh commented Jan 19, 2024

amyeroberts commented Jan 22, 2024 • edited Loading

VictorSanh commented Jan 22, 2024

HugoLaurencon commented Jan 22, 2024

amyeroberts commented Jan 23, 2024

davidgxue May 7, 2024 • edited Loading

Choose a reason for hiding this comment

NielsRogge May 8, 2024

Choose a reason for hiding this comment

rootAvish commented May 30, 2024 • edited Loading

NielsRogge commented Jun 1, 2024

NielsRogge commented Oct 1, 2023 •

edited

Loading

NielsRogge commented Jan 2, 2024 •

edited

Loading

VictorSanh commented Jan 18, 2024 •

edited

Loading

xenova commented Jan 19, 2024 •

edited

Loading

amyeroberts commented Jan 22, 2024 •

edited

Loading

davidgxue May 7, 2024 •

edited

Loading

rootAvish commented May 30, 2024 •

edited

Loading