https://arxiv.org/abs/2302.00402
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video (Haiyang Xu, Qinghao Ye, Ming Yan, Yaya Shi, Jiabo Ye, Yuanhong Xu, Chenliang Li, Bin Bi, Qi Qian, Wei Wang, Guohai Xu, Ji Zhang, Songfang Huang, Fei Huang, Jingren Zhou)
vision/text/image-text/video-text 통합 모델이군요.
#multimodal