Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text embeddings in distillation loss #1

Open
gonzachiar opened this issue Jun 16, 2023 · 1 comment
Open

Text embeddings in distillation loss #1

gonzachiar opened this issue Jun 16, 2023 · 1 comment

Comments

@gonzachiar
Copy link

In the distillation loss of continual-CLIP:

https://github.com/Thunderbeee/ZSCL/blob/main/cil/continual_clip/models.py#LL260C4-L260C4

Shouldn't you also do the opposite comparison too? Compare the current model embeddings of the ref_text with the original model embeddings of the ref_images.

Also, is the method is "LwF", shouldn't the logits_current be between the current model embeddings of the ref_images and the ref_texts, instead of being between the current model embeddings of the ref_images and the ref model embeddings of the ref_texts?

Screenshot from 2023-06-16 11-31-08

Screenshot from 2023-06-16 11-30-53

If I that isn't the case, there is no possibility of fine tuning the text encoder only. Why is this discarded for continuous CLIP?

Sorry if this questions are pretty basic.

@Thunderbeee
Copy link
Owner

Thanks so much for your comments! Toward your questions:

“Shouldn't you also do the opposite comparison too?”
--- Because LwF has not been applied on contrastive learning approached before (it is the first time to adopt LwF to handle forgetting issue on CLIP), what we do in our experiments is our design choice. Because we are comparing between continual-learning (CL) methods, experiments are controlled as long as all CL methods using the exact same assignment from (ref_model, ref_image, ref_text, zero shot, target_image, target_text). From our final experiment tables on arxiv, our experiments are sufficient to demonstrate our method (ZSCL) outperform those SOTA method. Of course, your suggestion is helpful which could be done in future ablation experiments on CL of VL!

“Also, is the method is "LwF", shouldn't the logits_current be between the current model embeddings of the ref_images and the ref_texts, instead of being between the current model embeddings of the ref_images and the ref model embeddings of the ref_texts?”
--- I think this question is similar to the last question. This is our design choice. For double encoders, we let ref-texts-embedding come from the reference-model (original CLIP) to better ensure that the current training image-encoder would be aligned with the text-encoder of original CLIP (reference model); for double encoders, we want to control variables when distilling to ensure that an encoder comes from original CLIP.

“If I that isn't the case, there is no possibility of fine tuning the text encoder only. Why is this discarded for continuous CLIP?”
--- It is feasible to only train text-encoder, but we do this uniformly here to control variables, because our goal is to compare the differences between methods, it is our design choice on experiments.

Again, thanks so much for your constructive comments! Many ablation studies could be done in future to explore the continual learning of vision-language model!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants