Skip to content

Latest commit

 

History

History
7 lines (4 loc) · 586 Bytes

230209 Re-ViLM.md

File metadata and controls

7 lines (4 loc) · 586 Bytes

https://arxiv.org/abs/2302.04858

Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Captioning (Zhuolin Yang, Wei Ping, Zihan Liu, Vijay Korthikanti, Weili Nie, De-An Huang, Linxi Fan, Zhiding Yu, Shiyi Lan, Bo Li, Ming-Yu Liu, Yuke Zhu, Mohammad Shoeybi, Bryan Catanzaro, Chaowei Xiao, Anima Anandkumar)

retrieval augment를 꼭 텍스트에 대해서만 쓸 필요는 없겠죠. vision-language 모델에서 이미지 입력으로 이미지-캡션 페어를 retrieval 해서 이 캡션을 사용해 캡션을 생성하겠다는 생각.

#vision-language