MultimodalTransformers

CLIP

This is a simple implementation of Natural Language-based Image Search inspired by the CLIP approach as proposed by the paper Learning Transferable Visual Models From Natural Language Supervision by OpenAI in PyTorch Lightning. We also use Weights & Biases for experiment tracking, visualizing results, comparing performance of different backbone models, hyperparameter optimization and to ensure reproducibility.

python examples/train_clip.py

This command will initialize a CLIP model with a ResNet50 image backbone and a distilbert-base-uncased text backbone.

📚 CLIP: Connecting Text and Images

CLIP (Contrastive Language–Image Pre-training) builds on a large body of work on zero-shot transfer, natural language supervision, and multimodal learning. CLIP pre-trains an image encoder and a text encoder to predict which images were paired with which texts in our dataset. This behavior turns CLIP into a zero-shot classifier. All of a dataset’s classes are converted into captions such as “a photo of a dog” followed by predicting the class of the caption in which CLIP estimates best pairs with a given image.

You can read more about CLIP here and here

💿 Dataset

This implementation of CLIP supports training on two datasets Flickr8k which contains ~8K images with 5 captions for each image and Flickr30k which contains ~30K images with corresponding captions.

🤖 Model

A CLIP model uses a text encoder and an image encoder. This repostiry supports pulling image models from PyTorch Image Models and transformer models from huggingface transformers.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
configs		configs
lmms		lmms
tools		tools
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MultimodalTransformers

CLIP

📚 CLIP: Connecting Text and Images

💿 Dataset

🤖 Model

About

Releases

Packages

Languages

License

jianzhnie/MultimodalTransformers

Folders and files

Latest commit

History

Repository files navigation

MultimodalTransformers

CLIP

📚 CLIP: Connecting Text and Images

💿 Dataset

🤖 Model

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages