This repository contains the models described in the following paper:
Orhan AE (2023) Scaling may be all you need for achieving human-level object recognition capacity with human-like visual experience. arXiv:2308.03712.
The models are all hosted on huggingface. You will need the huggingface_hub
library to download the models from huggingface. Model names are specified in the format x_y_z
, where x
is the model architecture, y
is the fraction of the combined video dataset used for self-supervised pretraining, and z
is the seed:
x
can be one ofvits14
,vitb14
,vitl14
,vith14
,vith14@448
,vith14@476
y
can be one of1.0
,0.1
,0.01
,0.001
,0.0001
z
can be one of1
,2
,3
Here, the model architectures (x
) are:
vits14
= ViT-S/14vitb14
= ViT-B/14vitl14
= ViT-L/14vith14
= ViT-H/14vith14@448
= ViT-H/14@448 (trained with 448x448 images)vith14@476
= ViT-H/14@476 (trained with 476x476 images)
and the data fractions (y
) are:
1.0
= full training data (~5000 hours)0.1
= 10% of the full data (~500 hours)0.01
= 1% of the full data (~50 hours)0.001
= 0.1% of the full data (~5 hours)0.0001
= 0.01% of the full data (~0.5 hours)
When training on proper subsets of the full data (0.01%-10%), subset selection was repeated 3 times. The seed z
corresponds to these three repeats. Note that for the full training data (100%), there is only one possible dataset, so z
can only be 1
for this case.
Loading a pretrained model is then as easy as:
from utils import load_model
model = load_model('vith14@476_1.0_1')
where 'vith14@476_1.0_1'
is the model identifier. This will download the corresponding pretrained checkpoint, store it in cache, build the right model architecture, and load the pretrained weights onto the model, all in one go!
When you load a pretrained model, you may get a warning message that says something like _IncompatibleKeys(missing_keys=[], unexpected_keys=...)
. This is normal. This happens because we're not loading the decoder model from the MAE pretraining stage. We're only interested in the encoder backbone.
The above will just load the pretrained model that is not finetuned on ImageNet. If you instead want to load an ImageNet-finetuned version of the model, you just need to set finetuned=True
:
from utils import load_model
model = load_model('vith14@476_1.0_1', finetuned=True)
This will load the corresponding model that is finetuned with 2% of ImageNet training data (this is called the permissive finetuning condition in the paper). Unfortunately, I have not saved the models finetuned on ~1% of ImageNet (called the stringent finetuning condition in the paper), so these are the only ImageNet-finetuned model I have for now, but please feel free to let me know if you might be interested in other finetuning settings too (it would not be very difficult for me to finetune the models under other desired finetuning settings).
In the finetuned models, we use a classifier head that consists of a batch normalization layer followed by a linear layer: i.e. BatchNorm1d
+ Linear
.
I also wrote some minimal test code in test_model.py
to check the accuracy of the finetuned models on the ImageNet validation set. You can use it as follows:
python -u test_model.py \
--model_id 'vith14@476_1.0_1' \
--input_size 476 \
--val_data_path IMAGENET_VAL_DATA_PATH
This may take up to ~35 minutes on an A100 for the largest models. You should get a top-1 accuracy above 44% and a top-5 accuracy above 71% with the two largest and most performant models, i.e. 'vith14@448_1.0_1'
and 'vith14@476_1.0_1'
.
The models were all pretrained with code from this repository, which is my personal copy of the excellent MAE repository from Meta AI. In particular, I have used this SLURM batch script to train all models (this script contains all training configuration details). Pretraining logs for all models can be found in the logs/pretraining_logs
folder.
The models were again finetuned with code from the same repository as the pretraining code. In particular, I have used this SLURM batch script to finetune all models (this script contains all relevant finetuning configuration details). All finetuning logs for the permissive condition can be found in the logs/finetuning_logs
folder.
One important point to note is that during finetuning, I have not used the standard heavy data augmentations and regularizers used for MAE finetuning (e.g. cutmix and mixup). I have instead used very minimal data augmentations (just random resized crops and horizontal flips; see here). This is to make sure the finetuning data remain as "human-like" as possible. In my experience, it is possible to get a few percentage points better results (in absolute terms) with the more standard heavy agumentation and regularization pipeline used for MAEs. There is a separate branch of my MAE repository that implements this more standard finetuning pipeline. You can use this branch if you would like to finetune the pretrained models in a more standard way.