The code for applying CLIM to CLIP model is adapted from OpenCLIP-v2.16.0. Run the following command to install the package
cd CLIM/
pip install -e . -v
The main experiments are conducted using images from COCO and CC3M Please prepare datasets and organize them like the following:
CLIM/
├── data
├── coco
├── annotations
├── panoptic_val2017.json
├── panoptic_val2017 # panoptic masks
├── wusize
├── captions_train2017_tags_allcaps.json
├── train2017
├── val2017
├── cc3m
├── cc3m_captions_train.json
├── train
The json file captions_train2017_tags_allcaps.json
for coco captions can be obtained from
GoogleDrive.
For CC3M dataset, please download the image using the csv file from the official
website, and then generate the json file
following the COCO format. The json file cc3m_captions_train.json
might look like:
{'images':
[
{'id': 1, 'file_name': 'train/0/0.jpg', 'captions': ['a very typical bus station']},
{'id': 4, 'file_name': 'train/3/3.jpg', 'captions': ['interior design of modern living room with fireplace in a new house']},
]
}
To run CLIM, first obtain the original models using these
links,
and put them under
checkpoints/
like the following:
CLIM/
├── checkpoints
├── ViT-B-16.pt
├── RN50x64.pt
We provide the scripts to run CLIM. For example, if we want to refine ViT-B/16 on the COCO dataset, simply run:
bash scripts/train_clim_coco_100e_openai_vitb16.sh
We also provide the checkpoints of the models trained by CLIM in Google Drive.
To build open-vocabulary detectors using the models trained by CLIM, please refer to the instructions in this README.