Update (Oct 3, 2024): All models in this repository are updated. We retrained all the models with a larger batch size and for a larger number of epochs. This resulted in substantial improvements in downstream evaluations. Please see the updated arxiv preprint below for the new evaluation results. If you downloaded any of the models here before Oct 3, 2024, please redownload them from this repository again for much improved versions.
This is a stand-alone repository to facilitate the use of all video models I have trained so far. The models are all hosted on this Huggingface repository. For a detailed description of the models available in this repository and their capabilities, please see the following paper:
Orhan AE, Wang W, Wang AN, Ren M, Lake BM (2024) Self-supervised learning of video representations from a child's perspective. CogSci 2024 (oral).
- A reasonably recent version of PyTorch and torchvision. The code was tested with
pytorch==2.4.0
andtorchvision==0.19.0
. - The
huggingface_hub
library to download the models from the Huggingface Hub. The code was tested withhuggingface-hub==0.24.5
. - The model definitions rely on the
timm
library. The code was tested withtimm==1.0.8
. - You do not need a GPU to load and use these models, although, of course, things will run much faster on a GPU.
Model names are specified in the format x_y_z
, where x
is the model type, y
is the pretraining data the model is trained with, and z
is the finetuning data the model is finetuned with (if any). All models have a ViT-H/14 backbone.
x
can be one ofmae
,vit
y
can be one ofsay
,s
,kinetics
,kinetics-200h
z
can be one ofnone
,ssv2-10shot
,ssv2-50shot
,kinetics-10shot
,kinetics-50shot
Loading a pretrained model is then as easy as:
from utils import load_model
model = load_model('vit_s_none')
This will download the corresponding pretrained checkpoint, store it in cache, build the right model architecture, and load the pretrained weights onto the model, all in one go.
Model types (x
):
mae
will instantiate a spatiotemporal MAE architecture (with an encoder and a decoder)vit
will instantiate a standard spatiotemporal ViT-H/14 architecture.
If you'd like to continue training the pretrained models on some new data with the spatiotemporal MAE objective or if you'd like to analyze the pretrained MAE models (for example, analyze their video interpolation capabilities), you should use the mae
option. If you'd like to finetune the model on a standard downstream video/image recognition task, or something similar, you should choose the vit
option instead.
Pretraining data (y
):
say
: the full SAYCam datasets
: child S onlykinetics
: the full Kinetics-700 datasetkinetics-200h
: a 200-hour subset of Kinetics-700
The models were all pretrained with the spatiotemporal MAE objective using code from this repository. The SLURM batch scripts used for training all models can be found here.
Finetuning data (z
):
none
: no finetuning (you will need to use this option if you choose themae
option forx
)ssv2-10shot
: the 10-shot SSV2 taskssv2-50shot
: the 50-shot SSV2 taskkinetics-10shot
: the 10-shot Kinetics-700 taskkinetics-50shot
: the 50-shot Kinetics-700 task
The models were again all finetuned with code from this repository. The SLURM batch scripts used for finetuning all models can be found here.
You can see a full list of all available models by running:
>>> print(utils.get_available_models())
You will get an error if you try to load an unavailable model.
In visualize_completion.py
, I provide sample code to visualize model completions from pretrained spatiotemporal MAE models. An example usage would be as follows:
python -u visualize_completion.py \
--model_name 'mae_s_none' \
--mask_ratio 0.25 \
--mask_type 'center' \
--video_dir 'demo_videos' \
--num_vids 16 \
--device 'cuda'
This will randomly sample num_vids
videos from video_dir
and visualize the model completions together with the original sequence of frames and the masked frames. Currently, three types of masking strategies are supported: random
(random spatiotemporal masking as in pretraining), temporal
(masking out the final portion of the sequence), and center
(masking out the middle part of the sequence, as described in the paper). Running the code with these masking strategies will produce images like the following, where the top row is the original sequence, the middle row is the masked sequence, and the bottom row is the model completion:
Further examples can be found in the comps folder.
In visualize_attention.py
, I provide sample code to visualize the last-layer attention maps of the pretrained models. An example usage would be as follows:
python -u visualize_attention.py \
--model_name 'vit_s_none' \
--video_dir 'demo_videos' \
--num_vids 16 \
--device 'cuda'
Similar to the above, this will randomly sample num_vids
videos from video_dir
and visualize the last-layer attention maps (averaged over all attention heads) together with the original sequence of frames. Running the above will produce images like the following:
Further examples can be found in the atts folder.
It should be straightforward to hack the code to obtain the individual attention heads if you'd like to visualize them separately.
I also include some minimal test code in test_video_recognition.py
to check the validation accuracy of the finetuned models in downstream video recognition tasks (SSV2 or Kinetics-700). You can use it as follows:
python -u test_video_recognition.py \
--model_name 'vit_s_ssv2-50shot' \
--img_size 224 \
--batch_size 256 \
--val_dir VAL_DIR \
where val_dir
is the path to the validation set of the appropriate downstream recognition task. Note that the task should be the same as the one the model was finetuned on.
The model definitions and parts of the code here are recycled from Facebook's excellent Spatiotemporal Masked Autoencoders repository.