Themo, named after the beloved Chilean cartoonist Themo Lobos, is a BERT-based CLIP text encoder trained in spanish.
Multimodal learning has revolutionized many aspects of deep learning, but most of these models are only trained in english, and thus only work in said language.
Our goal here is to take advantage of the knowledge already present in CLIP, and fine tune a language model pre-trained on spanish to learn to translate into CLIP's shared latent space, following Multilingual-CLIP's approach.
Currently, we have only trained a small proof of concept version. We plan to train more versions once we have a robust spanish-only multimodal dataset, and access to more GPU's. 😊
To train your own version of Themo, simply run:
python -m themo train
Our best results were achieved with the following hyperparameters:
python -m themo train --batch-size 256 --learn-rate 8e-5
Which achieved a final training loss of 0.244
and the following evaluation scores:
@01 | @05 | @10 | |
---|---|---|---|
Accuracy | 0.366 | 0.586 | 0.649 |
Retrieval | 0.481 | 0.752 | 0.85 |
To evaluate your trained model, run (something like):
python -m themo test --version-path logs/.../version_X
For the sake of comparison, here are the baseline results (taken from Multilingual-CLIP):
@01 | @05 | @10 | |
---|---|---|---|
Accuracy | 0.370 | 0.594 | 0.660 |
Retrieval | 0.504 | 0.795 | 0.888 |
These can also be accessed running:
python -m themo test --baseline
Some data is kinda tricky to get and/or is super redundant because we could only use the test set.
For simplicity here are some instructions on how to download the data we are using.
The captions come from the official repo of XTD10 and the implementation takes care of the download.
The images come from standard MSCOCO, but not all images are used. To download the filtered version run:
mkdir -p data/mscoco && wget -O- https://users.dcc.uchile.cl/\~gchapero/datasets/coco_xtd10.tar.gz | tar -xz -C data/mscoco
You can use full MSCOCO but it is disk-inefficient.
The data directories should look like this for the images to be located properly:
data
...
├── mscoco
│ ├── train2014
│ │ ...
│ │ ├── COCO_train2014_000000436508.jpg
│ │ ├── COCO_train2014_000000436515.jpg
│ │ ...
│ └── val2014
│ ...
│ ├── COCO_val2014_000000127068.jpg
│ ├── COCO_val2014_000000127074.jpg
│ ...
...
The command here should leave things in this format. Any extra dirs and files are ignored, so you can use full MSCOCO if you want.
Same as with MSCOCO, you can use the full ImageNet in the data dirrectory, but the training images are not needed. The following command only downloads the splits needed for this work:
mkdir -p data/imagenet && wget -O- https://users.dcc.uchile.cl/\~gchapero/datasets/imagenet_object_localization_patched2019_val_test_only.tar.gz | tar -xzC data/imagenet
The data directory should end up looking like this, whether you use full ImageNet or our filtered version:
data/
├── imagenet
│ ├── ILSVRC
│ │ ├── Annotations
│ │ │ └── CLS-LOC
│ │ │ └── val
│ │ │ ├── ILSVRC2012_val_00000001.xml
│ │ │ ├── ILSVRC2012_val_00000002.xml
│ │ │ └── ...
│ │ └── Data
│ │ └── CLS-LOC
│ │ ├── test
│ │ │ ├── ILSVRC2012_test_00000001.JPEG
│ │ │ ├── ILSVRC2012_test_00000002.JPEG
│ │ │ └── ...
│ │ └── val
│ │ ├── ILSVRC2012_val_00000001.JPEG
│ │ ├── ILSVRC2012_val_00000002.JPEG
│ │ └── ...
│ ├── LOC_sample_submission.csv
│ ├── LOC_synset_mapping.txt
│ ├── LOC_train_solution.csv
│ └── LOC_val_solution.csv
└── ...
Any extra dirs or files are ignored, so that you can use the full ImageNet if you have it at hand.