- CLIP in point-cloud/3D.
- Open-Vocabulary Object Detection (OVD)
- efficient CLIP training (better use of computation or data)
- applying CLIP models in narrow fields; such as Human Object Interaction detection, crowd counting...etc
Papers from CVPR2023:
(might missed some papers)
Title | Description | Code |
---|---|---|
DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training | Reducing memory consumption through decomposing the gradient | code |
Scaling Language-Image Pre-training via Masking | by adding masked image modelling to the image branch of clip it improved speed, memory, and performance | code |
Non-Contrastive Learning Meets Language-Image Pre-Training | added the loss introduced in SwAv (based on cluster assignment agreement) in addition to the contrastive loss of CLIP. interestingly, if non-Contrastive loss is used alone the zero-shot performance is bad but when used with contrastive loss (0.7swav + 0.3contrastive) it over perform the contrastive loss. Additionally, it helped the need for data (trained on 35-million only) and small batch size (4096 combared to 32K) | code |
Title | Description | Code |
---|---|---|
Learning to Name Classes for Vision and Language Models | created a learnable token embedding for the class names in otherwise frozen clip model, reduce the need for prompt engineering | NA |
Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models | when fine-tuning the model with linear classifier it is useful to train it from multi modality | NA |
MaPLe: Multi-modal Prompt Learning | learnable prompts on both the image and text branches, image prompt are derived from a linear layer that takes the text prompt as input | code |
Title | Description | Code |
---|---|---|
Fine-Tuned CLIP Models Are Efficient Video Learners | Adapts clip for videos. Claims that frame level clip embeddings from the videos though processed independently can still show temporal dependencies. Claims that instead of devising certain specific modules to address the temporal dependency in videos, simply fine-tuning ViFiCLIP can generalise to good performance. They do temporal pooling meaning pool embeddings from T frames and use that embedding in the contrastive learning process. This is probably why the embeddings are consistent with image based CLIP. | code |
Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting | Performs prompt learning on the video data to better fine tune image based CLIP model for videos. Same authors as of ViFi CLIP (above) Need to look into how the prompts are actually learned. | code |
Title | Description | Code |
---|---|---|
CrowdCLIP: Unsupervised Crowd Counting via Vision-Language Model | crowd counting with clip. fine-tune clip for the counting task using ranking loss. Does not use labels of people counts as ground truth for training. uses a sequential prompting setting to filter parts that only contain people heads for counting | code |
Title | Description | Code |
---|---|---|
AttriCLIP: A Non-Incremental Learner for Incremental Knowledge Learning | Used prompt tuning with CLIP to solve the problem of Continual learning, heavily inspired by CoOp | code |
Title | Description | Code |
---|---|---|
CLIP2: Contrastive Language-Image-Point Pretraining From Real-World Point Cloud Data | ... | ... |