Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoCa model implementation #517

Open
seungkyuK opened this issue Dec 27, 2023 · 1 comment
Open

CoCa model implementation #517

seungkyuK opened this issue Dec 27, 2023 · 1 comment

Comments

@seungkyuK
Copy link

🚀 The feature, motivation and pitch

Thank you for your awesome works!
I have some questions about CoCa model implementation.

In */multimodal/torchmultimodal/models/coca/coca_model.py, it seems like we can decide whether using CascadedAttentionPooler or just single AttentionPooler.
However, when using CascadedAttentionPooler, dimensions are not matched at the second loop.

For example, after vision feature is extracted from VisionEncoder and its feature has shape of (B, h*w, dim).
It has to pass through vision_pooler layers (pooled_outputs = self.vision_pooler(image_embeddings)) and when using CascadedAttentionPooler, 'self.vision_pooler' class has 2 sequential AttentionPooler layers.
After passed through 1st AttentionPooler layer, feature has shape of (B,256,q_dim) and it doesn't matched with the LayerNorm in the second loop which is supporting 'dim', not 'q_dim'.
Is it okay if I arbitrarily modify the input dimension of the second AttnetionPooler layer?

Similary, when using 'vision_cls_token' with CascadedAttentionPooler, shape of vision feature is (B, h*w + 1(cls), dim) (e.g., B,1025,768).
And at the vision_pooler layer, it return learnable tokens after cross-attention with vision feature and it has (B,256,q_dim) shape for each captioning_image_embeddings, and contrastive_image_embeddings, respectively.
If you intended to not using visual features directly, is it necessary to add 'cls_token' at the initial stage?
I mean, what is the purpose of adding 'cls_token' at the front of visual features even though, you're not using them directly.

Thank you again!

Alternatives

No response

Additional context

No response

@ebsmothers
Copy link
Contributor

ebsmothers commented Jan 3, 2024

Hi @seungkyuK thanks for creating the issue! Sorry for the delayed reply, I missed this one over the holidays.

You are right about (1): we need to change input_embed_dim for the contrastive pooler to match the output dim from the captioning pooler. I will open up a quick PR to fix this.

On (2): this is an interesting case. Actually we went back and forth on whether to include the CLS token in CoCa's vision encoder at all because it is not really clear from the paper that they use it. The open_clip implementation (which we compared against in #507) does use it, but they also only used global average pooling in their original implementation. However, our read of the pseudocode in Figure 2 of the paper was that they do not use CLS. As a result you'll see that most of our models default to vision_include_cls_embed=False. So we actually didn't intend for people to set both vision_include_cls_embed and cascaded_pooler to True (though we did not do a good enough job making this clear).

If you are setting vision_include_cls_embed to True and you want to use the vision encoder's CLS token directly, you can define your own pooler (we already have the CLS pooler defined here, one option is to just concat that with the usual attention pooler for the captioning objective). The alternative is to just use the contrastive pooler to aggregate over all tokens by setting n_queries=1 here (I think we should make this change anyways, it doesn't make sense to return more than one token for the contrastive objective).

Then you are correct that the CLS embedding is no longer used directly (and actually I think this is true regardless of whether we use cascaded or parallel attention poolers). One thing we could do is modify coca_vit to make it easier to define a custom pooler that uses CLS directly (similarly to the one I mentioned above), otherwise it has to be done from the CoCaModel class which is a bit more work.

For now I will at least make the change to fix (1) and set n_queries=1 in the contrastive pooler by default. Please let me know if my discussion of (2) makes sense and whether there's anything we can do to make things clearer on the attention pooler + CLS front. Update: opened #518 for this

facebook-github-bot pushed a commit that referenced this issue Jan 4, 2024
Summary:
A couple fixes to CoCa's attention pooling as pointed out in #517. Specifically, we need to change the input dim for the contrastive pooler to match the output dim from the captioning pooler in the case of cascaded attention pooling. We should also set `n_queries=1` for the contrastive pooler so that the pooled embeddings can be directly fed into contrastive loss (after appropriate normalization).

Pull Request resolved: #518

Test Plan:
```
from torchmultimodal.models.coca.coca_model import coca_vit_l_14
model = coca_vit_l_14()
bs, c, h, w, seq_len, vocab_size = 2, 3, 224, 224, 77, 49408
images = torch.randn(bs, c, h, w)
texts = torch.randint(0, vocab_size, (bs, seq_len))
out = model(images, texts)
print(out.image_pooled_output.shape, out.multimodal_embeddings.shape)
...
torch.Size([2, 1, 768]) torch.Size([2, 76, 49408])
```

Add new unit test:

```
python -m pytest -v tests/models/coca/test_coca_model.py
...
===== 4 passed in 3.18s ======
```

Reviewed By: pbontrager

Differential Revision: D52523771

Pulled By: ebsmothers

fbshipit-source-id: 7c0197605e478ae6e3204f1ec0ab2e6adbf2377e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants