This is a simple implementation of Natural Language-based Image Search inspired by the CLIP approach as proposed by the paper Learning Transferable Visual Models From Natural Language Supervision by OpenAI in PyTorch Lightning. We also use Weights & Biases for experiment tracking, visualizing results, comparing performance of different backbone models, hyperparameter optimization and to ensure reproducibility.
python examples/train_clip.py
This command will initialize a CLIP model with a ResNet50 image backbone and a distilbert-base-uncased text backbone.
CLIP (Contrastive Language–Image Pre-training) builds on a large body of work on zero-shot transfer, natural language supervision, and multimodal learning. CLIP pre-trains an image encoder and a text encoder to predict which images were paired with which texts in our dataset. This behavior turns CLIP into a zero-shot classifier. All of a dataset’s classes are converted into captions such as “a photo of a dog” followed by predicting the class of the caption in which CLIP estimates best pairs with a given image.
You can read more about CLIP here and here
This implementation of CLIP supports training on two datasets Flickr8k which contains ~8K images with 5 captions for each image and Flickr30k which contains ~30K images with corresponding captions.
A CLIP model uses a text encoder and an image encoder. This repostiry supports pulling image models from PyTorch Image Models and transformer models from huggingface transformers.