This repo lists recent advantages on VLMs, mainly contributed by Weihan Wang and Ji Qi.
Model | Vision Enc. | Textual Enc. | Dec. | Multimodal Fusion | Pretraining Objectives | Pretraining Dataset | Published Year |
---|---|---|---|---|---|---|---|
ViLBERT | OD->Xformer | Xformer | / | Co-attn | MLM+ITM+MIM | CC3M | 2019 (NIPS) |
LXMERT | OD+Xformer | Xformer | / | Co-attn | MLM+ITM+MIM+VQA | COCO+VG+VQA | 2019 (Arxiv) |
VisualBERT | OD | Emb. | / | Merged-attn | MLM+ITM | COCO | 2019 (Arxiv) |
UNITER | OD | Emb. | / | Merged-attn | MLM+ITM+MIM+WRA | COCO+VG+CC3M+SBU | 2020 (ECCV) |
VL-BERT | OD | Emb. | / | Merged-attn | MLM+ITM | CC3M | 2020 (ICLR) |
OSCAR | OD | Emb. | / | Merged-attn | MLM+ITM | 4.1M | 2020 (ECCV) |
PixelBERT | CNN | Xformer | / | Merged-attn | MLM+ITM | COCO+VG | 2020 (Arxiv) |
VILLA | OD | Emb. | / | Merged-attn | Adversarial Training+MLM+MIM+ITM | COCO+VG+CC3M+SBU | 2020 (NIPS) |
ViLBERT-12in1 | OD->Xformer | Xformer | / | Co-attn | Multi Tasks | Multi Datasets | 2020 (CVPR) |
CLIP | CNN/Xformer | Xformer | / | / | ITC | 400M | 2021 (ICML) |
ALIGN | CNN | Xformer | / | / | ITC | 1800M | 2021 (ICML) |
VinVL | OD | Emb. | / | Merged-attn | MLM+ITM | COCO+VG+OI+OBJ365 | 2021 (CVPR) |
MDETR | CNN | Xformer | √ | Merged-attn | OD+Token Prediction+Contrastive Alignment | COCO+VG+Flickr | 2021 (ICCV) |
VL-T5 | OD | Emb. | √ | Merged-attn | MLM+ITM+VQA+Grounding+Captioning | COCO+VG | 2021 (ICML) |
CLIP-VIL | CNN | Emb. | / | Merged-attn | MLM+ITM+VQA | COCO+VG+VQA | 2021 (Arxiv) |
SOHO | CNN | Emb. | / | Merged-attn | MLM+ITM+MIM | COCO+VG | 2021 (CVPR) |
VILT | Patch Emb. | Emb. | / | Merged-attn | MLM+ITM | COCO+VG+CC3M+SBU | 2021 (ICCV) |
ALBEF | Xformer | Xformer | / | Co-attn | MLM+ITM+ITC | COCO+VG+CC12M+SBU | 2021 (NIPS) |
VLMO | Xformer | Xformer | / | Multiway-attn | MLM+ITM+ITC | 4M/1000M | 2021 (Arxiv) |
Florence | Xformer | Xformer | / | / | ITC | 900M | 2021 (Arxiv) |
OFA | CNN | Emb. | √ | Co-attn | Multi Tasks | 20M | 2022 (ICML) |
METER | Xformer | Xformer | / | Co-attn | MLM+ITM | COCO+VG+CC3M+SBU | 2022 (CVPR) |
GLIP | Xformer | Xformer | / | Co-attn | OD+Token Prediction+Contrastive Alignment | FourODs+GoldG+Cap24M | 2022 (CVPR) |
GLIP-v2 | Xformer | Xformer | / | Co-attn | MLM+OD+Token Prediction+Contrastive Alignment | FourODs+GoldG+Cap24M | 2022 (NIPS) |
SimVLM | CNN | Emb. | / | Merged-attn | PrefixLM | 1800M | 2022 (ICLR) |
Flamingo | Xformer | Xformer | √ | Co-attn | ITC+Captioning+.. | 1.8B+LTIP+VTP | 2022 (Arxiv) |
PALI | Xformer | Xformer | √ | Co-attn | Multi Tasks | 10b image+12b text+29b image-ocr | 2022 (Arxiv) |
FIBER | Xformer | Xformer | / | Co-attn | MLM+ITM+ITC | COCO+VG+CC3M+SBU | 2022 (Arxiv) |
COCA | Xformer | Xformer | √ | Co-attn | ITC+Captioning | JFT-3B+Align | 2022 (Arxiv) |
BEIT-3 | Xformer | Xformer | / | Co-attn | MLM | COCO+VG+CC3M+CC12M+SBU | 2022 (Arxiv) |
- I : image inputs
- T : text inputs
- OD : objective detector
- Xformer : transformer
- Emb. : embedding
- MLM : masked language modeling
- MIM : masked image modeling
- ITM : image-text matching
- WRA : word-region alignment
- ITC : image-text contrastive learning
1.CARETS: A Consistency And Robustness Evaluative Test Suite for VQA
(ACL 2022)[paper]
2.VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena
(ACL 2022)[paper]