Area for research:

CLIP in point-cloud/3D.
Open-Vocabulary Object Detection (OVD)
efficient CLIP training (better use of computation or data)
applying CLIP models in narrow fields; such as Human Object Interaction detection, crowd counting...etc

Papers from CVPR2023:

(might missed some papers)

pretraining CLIP models:

Title	Description	Code
DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training	Reducing memory consumption through decomposing the gradient	code
Scaling Language-Image Pre-training via Masking	by adding masked image modelling to the image branch of clip it improved speed, memory, and performance	code
Non-Contrastive Learning Meets Language-Image Pre-Training	added the loss introduced in SwAv (based on cluster assignment agreement) in addition to the contrastive loss of CLIP. interestingly, if non-Contrastive loss is used alone the zero-shot performance is bad but when used with contrastive loss (0.7swav + 0.3contrastive) it over perform the contrastive loss. Additionally, it helped the need for data (trained on 35-million only) and small batch size (4096 combared to 32K)	code

Finetuning CLIP models:

Title	Description	Code
Learning to Name Classes for Vision and Language Models	created a learnable token embedding for the class names in otherwise frozen clip model, reduce the need for prompt engineering	NA
Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models	when fine-tuning the model with linear classifier it is useful to train it from multi modality	NA
MaPLe: Multi-modal Prompt Learning	learnable prompts on both the image and text branches, image prompt are derived from a linear layer that takes the text prompt as input	code

CLIP in video:

Title	Description	Code
Fine-Tuned CLIP Models Are Efficient Video Learners	Adapts clip for videos. Claims that frame level clip embeddings from the videos though processed independently can still show temporal dependencies. Claims that instead of devising certain specific modules to address the temporal dependency in videos, simply fine-tuning ViFiCLIP can generalise to good performance. They do temporal pooling meaning pool embeddings from T frames and use that embedding in the contrastive learning process. This is probably why the embeddings are consistent with image based CLIP.	code
Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting	Performs prompt learning on the video data to better fine tune image based CLIP model for videos. Same authors as of ViFi CLIP (above) Need to look into how the prompts are actually learned.	code

Crowd Counting:

Title	Description	Code
CrowdCLIP: Unsupervised Crowd Counting via Vision-Language Model	crowd counting with clip. fine-tune clip for the counting task using ranking loss. Does not use labels of people counts as ground truth for training. uses a sequential prompting setting to filter parts that only contain people heads for counting	code

Generative:

Title	Description	Code
ShapeClipper: Scalable 3D Shape Learning from Single-View Images via Geometric and CLIP-based Consistency	...	...
CLIP-Sculptor: Zero-Shot Generation of High-Fidelity and Diverse Shapes From Natural Language	...	...
CLIP2Protect: Protecting Facial Privacy using Text-Guided Makeup via Adversarial Latent Search	...	...
Local 3D Editing via 3D Distillation of CLIP Knowledge	...	...

Continual learning:

Title	Description	Code
AttriCLIP: A Non-Incremental Learner for Incremental Knowledge Learning	Used prompt tuning with CLIP to solve the problem of Continual learning, heavily inspired by CoOp	code

3D and Point-cloud:

Title	Description	Code
CLIP2: Contrastive Language-Image-Point Pretraining From Real-World Point Cloud Data	...	...

Detection:

Title	Description	Code
DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment
CLIP Is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation
WinCLIP: Zero-/Few-Shot Anomaly Classification and Segmentation
HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
configs		configs
data		data
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Area for research:

Papers from CVPR2023:

pretraining CLIP models:

Finetuning CLIP models:

CLIP in video:

Crowd Counting:

Generative:

Continual learning:

3D and Point-cloud:

Detection:

About

Releases

Packages

Contributors 3

Languages

Faisal-Hajari/KD

Folders and files

Latest commit

History

Repository files navigation

Area for research:

Papers from CVPR2023:

pretraining CLIP models:

Finetuning CLIP models:

CLIP in video:

Crowd Counting:

Generative:

Continual learning:

3D and Point-cloud:

Detection:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages