Skip to content

ViT models pretrained with up to ~5k hours of human-like video data

License

Notifications You must be signed in to change notification settings

eminorhan/humanlike-vits

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vision transfomers trained with human-like video data

This repository contains the models described in the following paper:

Orhan AE (2023) Scaling may be all you need for achieving human-level object recognition capacity with human-like visual experience. arXiv:2308.03712.

Loading the models

The models are all hosted on huggingface. You will need the huggingface_hub library to download the models from huggingface. Model names are specified in the format x_y_z, where x is the model architecture, y is the fraction of the combined video dataset used for self-supervised pretraining, and z is the seed:

  • x can be one of vits14, vitb14, vitl14, vith14, vith14@448, vith14@476
  • y can be one of 1.0, 0.1, 0.01, 0.001, 0.0001
  • z can be one of 1, 2, 3

Here, the model architectures (x) are:

  • vits14 = ViT-S/14
  • vitb14 = ViT-B/14
  • vitl14 = ViT-L/14
  • vith14 = ViT-H/14
  • vith14@448 = ViT-H/14@448 (trained with 448x448 images)
  • vith14@476 = ViT-H/14@476 (trained with 476x476 images)

and the data fractions (y) are:

  • 1.0 = full training data (~5000 hours)
  • 0.1 = 10% of the full data (~500 hours)
  • 0.01 = 1% of the full data (~50 hours)
  • 0.001 = 0.1% of the full data (~5 hours)
  • 0.0001 = 0.01% of the full data (~0.5 hours)

When training on proper subsets of the full data (0.01%-10%), subset selection was repeated 3 times. The seed z corresponds to these three repeats. Note that for the full training data (100%), there is only one possible dataset, so z can only be 1 for this case.

Loading a pretrained model

Loading a pretrained model is then as easy as:

from utils import load_model

model = load_model('vith14@476_1.0_1')

where 'vith14@476_1.0_1' is the model identifier. This will download the corresponding pretrained checkpoint, store it in cache, build the right model architecture, and load the pretrained weights onto the model, all in one go!

When you load a pretrained model, you may get a warning message that says something like _IncompatibleKeys(missing_keys=[], unexpected_keys=...). This is normal. This happens because we're not loading the decoder model from the MAE pretraining stage. We're only interested in the encoder backbone.

Loading a finetuned model

The above will just load the pretrained model that is not finetuned on ImageNet. If you instead want to load an ImageNet-finetuned version of the model, you just need to set finetuned=True:

from utils import load_model

model = load_model('vith14@476_1.0_1', finetuned=True)

This will load the corresponding model that is finetuned with 2% of ImageNet training data (this is called the permissive finetuning condition in the paper). Unfortunately, I have not saved the models finetuned on ~1% of ImageNet (called the stringent finetuning condition in the paper), so these are the only ImageNet-finetuned model I have for now, but please feel free to let me know if you might be interested in other finetuning settings too (it would not be very difficult for me to finetune the models under other desired finetuning settings).

In the finetuned models, we use a classifier head that consists of a batch normalization layer followed by a linear layer: i.e. BatchNorm1d + Linear.

Testing the finetuned models

I also wrote some minimal test code in test_model.py to check the accuracy of the finetuned models on the ImageNet validation set. You can use it as follows:

python -u test_model.py \
        --model_id 'vith14@476_1.0_1' \
        --input_size 476 \
        --val_data_path IMAGENET_VAL_DATA_PATH

This may take up to ~35 minutes on an A100 for the largest models. You should get a top-1 accuracy above 44% and a top-5 accuracy above 71% with the two largest and most performant models, i.e. 'vith14@448_1.0_1' and 'vith14@476_1.0_1'.

Pretraining details

The models were all pretrained with code from this repository, which is my personal copy of the excellent MAE repository from Meta AI. In particular, I have used this SLURM batch script to train all models (this script contains all training configuration details). Pretraining logs for all models can be found in the logs/pretraining_logs folder.

Finetuning details

The models were again finetuned with code from the same repository as the pretraining code. In particular, I have used this SLURM batch script to finetune all models (this script contains all relevant finetuning configuration details). All finetuning logs for the permissive condition can be found in the logs/finetuning_logs folder.

One important point to note is that during finetuning, I have not used the standard heavy data augmentations and regularizers used for MAE finetuning (e.g. cutmix and mixup). I have instead used very minimal data augmentations (just random resized crops and horizontal flips; see here). This is to make sure the finetuning data remain as "human-like" as possible. In my experience, it is possible to get a few percentage points better results (in absolute terms) with the more standard heavy agumentation and regularization pipeline used for MAEs. There is a separate branch of my MAE repository that implements this more standard finetuning pipeline. You can use this branch if you would like to finetune the pretrained models in a more standard way.

About

ViT models pretrained with up to ~5k hours of human-like video data

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages