Implementation of CLIP model with a reduced capacity. For self-educational purposes only.
This repo currently contains only CLIP-ResNet implementation, while in the original paper there are 5 ResNets and 3 ViTs models. There was no intention to beat SotA or train a superior version of CLIP. This is just an attempt to understand the logic behind CLIP.
After training CLIP-ResNet50 for 10 epochs, the following results were obtained.
As can be seen, the results are not great, but the model is definetely trying to stick closer to correct pairs.
To run the training, you should first download the COCO dataset and provide paths to annotations and images for both train
and val
in a config (check example here).
After that, run:
python tools/train.py --path_to_config=configs/clip_base.yaml --path_to_log=logs/
This will create directory structure under the logs/
directory for each run separately (aka experiment directories):
logs/
|--{experiment_name}/
|--artifacts/
|--checkpoints/
|--train.log
|--{experiment_name}.yaml
Under the logs/{experiment_name}/artifacts/
a training_progress.log
will be saved, containing losses for train and validation.
Each training run generates an overrided config and saves it under the logs/{experiment_name}/
directory.
To plot similarity matrices on validation dataset, run:
python tools/plot_similarities.py --path_to_config=logs/{experiment_name}/{experiment_name}.yaml \
--path_to_ckpt=logs/{experiment_name}/checkpoints/some_ckpt.pth \
--n_pairs=8 \
--n_matricies=5
Here, n_matricies
denotes number of similarity matrices to create, and n_pairs
denotes number of image-text pairs to include into each similarity matrix.
All the similarity matrices will be saved under logs/{experiment_name}/artifacts/
.