Skip to content

Latest commit

 

History

History
8 lines (5 loc) · 377 Bytes

200401 Pixel-BERT.md

File metadata and controls

8 lines (5 loc) · 377 Bytes

https://arxiv.org/abs/2004.00849

Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers (Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, Jianlong Fu)

요즘 자주 나오는 image-text pretraining. bounding box로 feature를 추출하는 대신 이미지 전체를 resnet에 넣고 나온 feature를 펼쳐서 임베딩으로 입력.

#multimodal