Reconstruction-Consistent MAE (RC-MAE)

This repository inlcudes official implementations and model weights for RC-MAE.

[Arxiv] [OpenReview] [BibTeX]

Exploring The Role of Mean Teachers in Self-supervised Masked Auto-Encoders
🏛️🏫Youngwan Lee*, 🏫Jeff Willette*,️️ 🏛️Jonghee Kim, 🏫Juho Lee, 🏫Sung Ju Hwang
ETRI🏛️, KAIST🏫
*: equal contribution
Internation Conference on Learning Representation (ICLR) 2023

Abstract

Masked image modeling (MIM) has become a popular strategy for self-supervised learning (SSL) of visual representations with Vision Transformers. A representative MIM model, the masked auto-encoder (MAE), randomly masks a subset of image patches and reconstructs the masked patches given the unmasked patches. Concurrently, many recent works in self-supervised learning utilize the student/teacher paradigm which provides the student with an additional target based on the output of a teacher composed of an exponential moving average (EMA) of previous students. Although common, relatively little is known about the dynamics of the interaction between the student and teacher. Through analysis on a simple linear model, we find that the teacher conditionally removes previous gradient directions based on feature similarities which effectively acts as a conditional momentum regularizer. From this analysis, we present a simple SSL method, the Reconstruction-Consistent Masked Auto-Encoder (RC-MAE) by adding an EMA teacher to MAE. We find that RC-MAE converges faster and requires less memory usage than state-of-the-art self-distillation methods during pre-training, which may provide a way to enhance the practicality of prohibitively expensive self-supervised learning of Vision Transformer models. Additionally, we show that RC-MAE achieves more robustness and better performance compared to MAE on downstream tasks such as ImageNet-1K classification, object detection, and instance segmentation.

Finetuning Results on ImageNet-1K

We train all models on one node equipped with 8 GPUs.

	MAE (repro.)	RC-MAE (ours)	Checkpoint
ViT-S	81.8	82.0	PT \| FN
ViT-B	83.4	83.6	PT \| FN
ViT-L	85.5	86.1	PT \| FN

Instance segmentation Results on COCO

We train all models on one node equipped with 8 GPUs.
We implemented based on an mimdet which reproduced Benchmarking Detection Transfer Learning with Vision Transformers.

Method	Backbone	box AP	mask AP	Checkpoint
MAE	ViT-B	50.3	44.9	-
RC-MAE	ViT-B	51.0	45.4	link
MAE	ViT-L	53.3	47.2	-
RC-MAE	ViT-L	53.8	47.7	link

Requirements

We used pytorch==1.7.0 timm==0.3.2.
We implemented on top of the official mae code.

Pre-training RC-MAE

We provide scripts for pretraining, finetuning, and linear probing.

To pre-train ViT-Base (recommended default) with multi-GPU setting, run the following on 1 node with 8 GPUs:

bash ./scripts/pretrain_rc_mae_vit_base_1600ep.sh ${DATA} ${OUTPUT_DIR}

${DATA}: ImageNet data path
${OUTPUT_DIR}: ouptut folder name

Finetuning & Linear probing

bash ./scripts/finetune_rc_mae_base.sh ${DATA} ${CKPT_FILE_PATH} ${OUTPUT_DIR}

${DATA}: ImageNet data path
${CKPT_FILE_PATH}: pre-trained checkpoint file path
${OUTPUT_DIR}: ouptut folder name

Acknowledgement

This repository is built using the MAE, Timm library, and DeiT repositories.

This work was partly supported by Institute of Information & Communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) (No. 2014-3-00123, Development of High Performance Visual BigData Discovery Platform for Large-Scale Realtime Data Analysis, No. 2020-0-00004, Development of Previsional Intelligence based on Long-term Visual Memory Network), (No.2022-0-00124, Development of Artificial Intelligence Technology for Self-Improving Competency-Aware Learning Capabilities), (No. RS-2022-00187238, Development of Large Korean Language Model Technology for Efficient Pre-training), and (No.2019-0-00075, Artificial Intelligence Graduate School Program(KAIST)).

License

This project is under CC-BY-NC 4.0 license. Please see LICENSE for details.

CitingRC-MAE

@inproceedings{
    lee2023rcmae,
    title={Exploring The Role of Mean Teachers in Self-supervised Masked Auto-Encoders},
    author={Youngwan Lee and Jeffrey Ryan Willette and Jonghee Kim and Juho Lee and Sung Ju Hwang},
    booktitle={International Conference on Learning Representations},
    year={2023},
    url={https://openreview.net/forum?id=7sn6Vxp92xV}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Reconstruction-Consistent MAE (RC-MAE)

Abstract

Finetuning Results on ImageNet-1K

Instance segmentation Results on COCO

Requirements

Pre-training RC-MAE

Finetuning & Linear probing

Acknowledgement

License

CitingRC-MAE

Files

README.md

Latest commit

History

README.md

File metadata and controls

Reconstruction-Consistent MAE (RC-MAE)

Abstract

Finetuning Results on ImageNet-1K

Instance segmentation Results on COCO

Requirements

Pre-training RC-MAE

Finetuning & Linear probing

Acknowledgement

License

CitingRC-MAE