Skip to content

The official implementation of MAGVLT: Masked Generative Vision-and-Language Transformer (CVPR'23)

License

Notifications You must be signed in to change notification settings

kakaobrain/magvlt

Repository files navigation

MAGVLT: Masked Generative Vision-and-Language Transformer


The official PyTorch implementation of Masked Generative Vision-and-Language Transformer, CVPR 2023

MAGVLT is a unified non-autoregressive generative Vision-and-Language (VL) model which is trained via 1) three multimodal masked token prediction tasks along with two sub-tasks, 2) step-unrolled masked prediction and 3) MixSel.

Requirements

We have tested our codes on the environment below

PyTorch 1.10.0
Python 3.7.11
Ubuntu 18.04

Please run the following command to install the other dependencies

pip install -r requirements.txt

Coverage of Released Codes

  • Implementation of MAGVLT
  • Pretrained checkpoints of MAGVLT-base and MAGVLT-large
  • Sampling pipelines of MAGVLT:
    • Generate image from text
    • Generate text from image
    • Generate image from text and image (inpainting)
    • Generate text from text and image (infilling)
    • Generate text and image (unconditional generation)
  • Evaluation pipelines of MAGVLT on downstream tasks
  • Training pipeline with data preparation example

Pretrained Checkpoints

MAGVLT uses VQGAN (vqgan_imagenet_f16_16384) as the image encoder which can be downloaded from this repo.

Model #Parameters CIDEr (↑, coco) CIDEr (↑, NoCaps) FID (↓, coco)
MAGVLT-base 371M 60.4 46.3 12.08
MAGVLT-large 840M 68.1 55.8 10.14

Sampling

We provide the following sampling codes.

python sampling_t2i.py  --prompt=[YOUR PROMPT] 
                        --config_path=configs/magvlt-it2it-base-sampling.yaml 
                        --model_path=[MAGVLT_MODEL_PATH] 
                        --stage1_model_path=[VQGAN_MODEL_PATH]

python sampling_i2t.py  --source_img_path=[YOUR_IMAGE_PATH] 
                        --config_path=configs/magvlt-it2it-base-sampling.yaml 
                        --model_path=[MAGVLT_MODEL_PATH] 
                        --stage1_model_path=[VQGAN_MODEL_PATH]

python sampling_it2i.py --prompt=[YOUR PROMPT] 
                        --source_img_path=[YOUR_IMAGE_PATH] 
                        --config_path=configs/magvlt-it2it-base-sampling.yaml 
                        --model_path=[MAGVLT_MODEL_PATH] 
                        --stage1_model_path=[VQGAN_MODEL_PATH]

Citation

@InProceedings{Kim_2023_CVPR,
    author    = {Kim, Sungwoong and Jo, Daejin and Lee, Donghoon and Kim, Jongmin},
    title     = {MAGVLT: Masked Generative Vision-and-Language Transformer},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2023},
    pages     = {23338-23348}
}

Contact

Donghoon Lee, dhlee@kakaobrain.com
Jongmin Kim, jmkim@kakaobrain.com

License

This project is released under MIT license.

About

The official implementation of MAGVLT: Masked Generative Vision-and-Language Transformer (CVPR'23)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages