Seminal Papers

Multimodal

Details

2015

CIDEr: Consensus-based Image Description Evaluation
- Authors: Ramakrishna Vedantam, C. Lawrence Zitnick, Devi Parikh
- Description: Introduces CIDEr, a metric for evaluating image description quality by comparing generated captions with consensus descriptions from humans.
- Link: CIDEr

2016

“Why Should I Trust You?” Explaining the Predictions of Any Classifier
- Authors: Marco Tulio Ribeiro, Sameer Singh, Carlos Guestrin
- Description: Proposes LIME, a method for explaining the predictions of any classifier by approximating it locally with an interpretable model.
- Link: LIME
SPICE: Semantic Propositional Image Caption Evaluation
- Authors: Peter Anderson, Basura Fernando, Mark Johnson, Stephen Gould
- Description: Introduces SPICE, an evaluation metric for image captioning that assesses the quality of generated captions based on semantic content.
- Link: SPICE

2017

A Unified Approach to Interpreting Model Predictions
- Authors: Scott M. Lundberg, Su-In Lee
- Description: Proposes SHAP (SHapley Additive exPlanations), a unified framework for interpreting the output of machine learning models.
- Link: SHAP
Mixup: Beyond Empirical Risk Minimization
- Authors: Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, David Lopez-Paz
- Description: Introduces Mixup, a data augmentation technique that generates new training examples by interpolating between pairs of examples.
- Link: Mixup
Multimodal Machine Learning: a Survey and Taxonomy
- Authors: Louis-Philippe Morency, Amir Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria
- Description: Provides a comprehensive survey and taxonomy of multimodal machine learning, covering various approaches and applications.
- Link: Multimodal Machine Learning

2019

Representation Learning with Contrastive Predictive Coding
- Authors: Aaron van den Oord, Yazhe Li, Oriol Vinyals
- Description: Proposes Contrastive Predictive Coding (CPC), a method for unsupervised representation learning by predicting future observations in a latent space.
- Link: CPC

2020

Modality Dropout for Improved Performance-driven Talking Faces
- Authors: Lincheng Li, Shaojie Shen, Yiqun Liu, Jia Jia
- Description: Introduces modality dropout, a technique for improving the performance of talking face generation models by dropping modalities during training.
- Link: Modality Dropout
Augmentation Adversarial Training for Self-supervised Speaker Recognition
- Authors: Weiyang Liu, Zhirong Wu, Andrew Owens, Yiming Zuo, Yann LeCun, Edward H. Adelson
- Description: Proposes augmentation adversarial training to enhance self-supervised speaker recognition models by generating adversarial examples during training.
- Link: Adversarial Training for Speaker Recognition
BERTScore: Evaluating Text Generation with BERT
- Authors: Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, Yoav Artzi
- Description: Introduces BERTScore, an evaluation metric for text generation that uses BERT embeddings to compare generated text with references.
- Link: BERTScore

2021

Comparing Data Augmentation and Annotation Standardization to Improve End-to-end Spoken Language Understanding Models
- Authors: Genta Indra Winata, Samuel Cahyawijaya, Zhaojiang Lin, Peng Xu, Pascale Fung
- Description: Examines the impact of data augmentation and annotation standardization techniques on the performance of spoken language understanding models.
- Link: Data Augmentation for SLU
Learning Transferable Visual Models from Natural Language Supervision
- Authors: Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever
- Description: Introduces CLIP, a model that learns transferable visual representations from natural language supervision by training on a diverse set of image-text pairs.
- Link: CLIP
Zero-Shot Text-to-Image Generation
- Authors: Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, Ilya Sutskever
- Description: Proposes DALL-E, a model that generates images from textual descriptions without any additional training, achieving zero-shot generation capabilities.
- Link: DALL-E
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
- Authors: Wonjae Kim, Bokyung Son, Ildoo Kim
- Description: Introduces ViLT, a vision-and-language transformer model that operates without convolutional layers or region supervision, simplifying the architecture while maintaining performance.
- Link: ViLT
MLIM: Vision-and-language Model Pre-training with Masked Language and Image Modeling
- Authors: Xuancheng Ren, Shuohang Wang, Zi Lin, Xin Jiang, Qun Liu
- Description: Proposes MLIM, a model pre-training approach that combines masked language modeling with masked image modeling to enhance vision-and-language tasks.
- Link: MLIM
MURAL: Multimodal, Multi-task Retrieval Across Languages
- Authors: Jingfei Du, Shuming Ma, Yunqiu Shao, Haoyang Li, Wenya Wang, Jianshu Chen, Tao Qin, Tie-Yan Liu
- Description: Introduces MURAL, a retrieval model designed to handle multimodal and multi-task scenarios across multiple languages, enhancing cross-lingual and cross-modal retrieval.
- Link: MURAL
Perceiver: General Perception with Iterative Attention
- Authors: Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, João Carreira
- Description: Proposes Perceiver, a model that uses iterative attention to handle diverse types of input data, generalizing across various perception tasks.
- Link: Perceiver
Multimodal Few-Shot Learning with Frozen Language Models
- Authors: Maria Barrett, Ivan Montero, Jonas Tegnér, Carina Silberer, Nora Hollenstein
- Description: Introduces a method for multimodal few-shot learning that leverages frozen language models to achieve high performance with limited data.
- Link: Multimodal Few-Shot Learning
On the Opportunities and Risks of Foundation Models
- Authors: Percy Liang, Tatsunori Hashimoto, Alexander R. Gritsenko, Natasha Jaques, Richard Yuanzhe Pang, Evan R. Liu, Curtis P. Langlotz, Marta R. Costa-jussà, Dan Jurafsky, James Zou
- Description: Provides a comprehensive analysis of the opportunities and risks associated with foundation models, including their impact on various applications and ethical considerations.
- Link: Foundation Models
CLIPScore: a Reference-free Evaluation Metric for Image Captioning
- Authors: Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, Alane Suhr, Jena D. Hwang, Chandra Bhagavatula, Yejin Choi
- Description: Introduces CLIPScore, a reference-free evaluation metric for image captioning that leverages CLIP embeddings to assess the quality of generated captions based on their semantic content.
- Link: CLIPScore
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
- Authors: Dongxu Li, Jinglin Liu, Hailin Jin, Baining Guo
- Description: Proposes VideoCLIP, a model that uses contrastive pre-training to achieve zero-shot video-text understanding, enhancing the alignment between video and text representations.
- Link: VideoCLIP

2022

DeepNet: Scaling Transformers to 1,000 Layers
- Authors: Jianfei Chen, Shiyuan Zheng, Hao Peng, Shuxin Zheng, Xingyuan Zhang, Ruoyu Sun, Yongjun Bao, Wengang Zhou, Houqiang Li
- Description: Introduces DeepNet, a technique for scaling transformers to 1,000 layers by stabilizing deep networks and improving training efficiency.
- Link: DeepNet
Data2vec: a General Framework for Self-supervised Learning in Speech, Vision and Language
- Authors: Alexei Baevski, William Chan, Arun Babu, Karen Livescu, Michael Auli
- Description: Proposes Data2vec, a unified framework for self-supervised learning across speech, vision, and language modalities, demonstrating its versatility and effectiveness.
- Link: Data2vec
Hierarchical Text-Conditional Image Generation with CLIP Latents
- Authors: Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen
- Description: Introduces a method for hierarchical text-conditional image generation using CLIP latents, improving the quality and coherence of generated images.
- Link: Hierarchical Text-Conditional Image Generation
AutoDistill: an End-to-End Framework to Explore and Distill Hardware-Efficient Language Models
- Authors: Suyog Gupta, Josh Fromm, Marvin Ritter, Rewon Child, Gabriel Goh, Sam McCandlish, Alicia Parrish, Geoffrey Irving
- Description: Proposes AutoDistill, a framework for exploring and distilling language models to create hardware-efficient versions without sacrificing performance.
- Link: AutoDistill
A Generalist Agent
- Authors: Scott Reed, Konrad Zolna, Emilio Parisotto, Benjamín Eysenbach, Rishabh Agarwal, Philippe Beaudoin, Gabriel Barth-Maron, Jonathan B. T. Wang, Evan Shelhamer, Michael Hausman, Paavo Parmas, Jost Tobias Springenberg, Abhishek Gupta, Nicolas Heess, Nando de Freitas
- Description: Proposes a generalist agent capable of performing a wide range of tasks across different domains by leveraging a single, unified model.
- Link: A Generalist Agent
Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors
- Authors: Aditya Ramesh, Mingda Chen, Shenlong Wang, Prafulla Dhariwal, Alec Radford, Ilya Sutskever
- Description: Introduces Make-A-Scene, a text-to-image generation model that incorporates human priors to create more accurate and realistic scenes based on textual descriptions.
- Link: Make-A-Scene
I-Code: an Integrative and Composable Multimodal Learning Framework
- Authors: Luyu Wang, Jingtao Ding, Xiaohua Zhai, Sergey Ioffe, Andrew Brock, Pengchuan Zhang, Ting Chen
- Description: Proposes I-Code, a framework for integrating and composing multimodal learning tasks to improve performance and scalability.
- Link: I-Code
VL-BEIT: Generative Vision-Language Pretraining
- Authors: Kuniaki Saito, Lala Li, Wenbing Huang, Hangbo Bao, Han Hu, Xin Geng, Lei Zhang
- Description: Introduces VL-BEIT, a model that uses generative pretraining for vision-language tasks, improving the efficiency and effectiveness of multimodal learning.
- Link: VL-BEIT
FLAVA: a Foundational Language and Vision Alignment Model
- Authors: Parsa Ghaffari, Hossein Aghajani, Zohreh Azizi, Anthony Platanios, Subhabrata Mukherjee, Florian Metze, Luke Zettlemoyer
- Description: Proposes FLAVA, a model designed to align language and vision modalities through a foundational framework, enhancing cross-modal understanding.
- Link: FLAVA
Flamingo: a Visual Language Model for Few-Shot Learning
- Authors: Jean-Baptiste Cordonnier, Simon Schug, Andrey Malinin, David Clark, Jan Hendrik Metzen
- Description: Introduces Flamingo, a visual language model optimized for few-shot learning tasks, enabling high performance with limited training data.
- Link: Flamingo
Stable and Latent Diffusion Model
- Authors: Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer
- Description: Proposes a diffusion model for generating stable and latent representations, improving the quality and robustness of generated outputs.
- Link: Stable Diffusion
DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation
- Authors: Nataniel Ruiz, Yuhao Zhang, Kfir Aberman, David Lindell, Guy Satat, Roger Grosse, Eli Shechtman, Jian Ren
- Description: Introduces DreamBooth, a method for fine-tuning text-to-image diffusion models to generate subject-specific images based on textual descriptions.
- Link: DreamBooth
UniT: Multimodal Multitask Learning with a Unified Transform
- Authors: Shauli Ravfogel, Omer Levy, Yoav Goldberg
- Description: Proposes UniT, a model for multimodal multitask learning that unifies different tasks under a single transformer architecture.
- Link: UniT
Perceiver IO: a General Architecture for Structured Inputs & Outputs
- Authors: Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, João Carreira
- Description: Extends the Perceiver architecture to handle structured inputs and outputs, enabling it to process and generate complex multimodal data.
- Link: Perceiver IO
Foundation Transformers
- Authors: Yann Lecun, Bernhard Schölkopf, Yoshua Bengio, Geoffrey Hinton, Andrew Ng, Samy Bengio, Sergey Levine
- Description: Discusses the concept of foundation transformers, large-scale models that serve as the basis for various downstream tasks across multiple modalities.
- Link: Foundation Transformers
Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language
- Authors: Alexei Baevski, William Chan, Arun Babu, Karen Livescu, Michael Auli
- Description: Proposes an efficient self-supervised learning method that uses contextualized target representations to improve performance across vision, speech, and language tasks.
- Link: Self-supervised Learning
Imagic: Text-Based Real Image Editing with Diffusion Models
- Authors: Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, William T. Freeman
- Description: Proposes Imagic, a framework for real image editing based on textual input using diffusion models to produce high-quality, realistic modifications.
- Link: Imagic
EDICT: Exact Diffusion Inversion Via Coupled Transformations
- Authors: Jingjing Liu, Richard Socher, Caiming Xiong, Steven C.H. Hoi
- Description: Introduces EDICT, a method for exact diffusion inversion using coupled transformations, enhancing the efficiency and accuracy of diffusion models.
- Link: EDICT
CLAP: Learning Audio Concepts from Natural Language Supervision
- Authors: Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov
- Description: Proposes CLAP, a model that learns audio concepts from natural language supervision, enabling cross-modal understanding between audio and text.
- Link: CLAP
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
- Authors: Peter Shaw, Jakob Uszkoreit, Ashish Vaswani, Kevin Gimpel, Himanshu Jain, Luke Zettlemoyer
- Description: Conducts an empirical study on the performance of GPT-3 for few-shot knowledge-based visual question answering (VQA), highlighting its capabilities and limitations.
- Link: Few-Shot Knowledge-Based VQA
OCR-free Document Understanding Transformer
- Authors: Mohammad Rashid, Peyman Milanfar, Parsa Ghaffari, Zohreh Azizi, Anthony Platanios, Subhabrata Mukherjee
- Description: Introduces a document understanding transformer model that operates without OCR, improving the efficiency and accuracy of document analysis.
- Link: OCR-free Document Understanding
PubTables-1M: Towards Comprehensive Table Extraction from Unstructured Documents
- Authors: Smita Ghosh, Arpita Balasubramanian, Kevin Small, Ranit Aharonov
- Description: Presents PubTables-1M, a dataset for comprehensive table extraction from unstructured documents, facilitating research in table recognition and analysis.
- Link: PubTables-1M
CoCa: Contrastive Captioners are Image-Text Foundation Models
- Authors: Gabriel Ilharco, Mitchell Wortsman, Ali Farhadi, Hannaneh Hajishirzi, Ludwig Schmidt
- Description: Introduces CoCa, a model that combines contrastive learning with image-text foundation models to improve image captioning and text-based image retrieval.
- Link: CoCa
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
- Authors: Junnan Li, Dongxu Li, Caiming Xiong, Steven C.H. Hoi
- Description: Proposes BLIP, a model for unified vision-language understanding and generation that uses bootstrapping techniques to improve pre-training.
- Link: BLIP
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
- Authors: Chuang Gan, Deng Huang, Hang Zhao, Joshua B. Tenenbaum, Antonio Torralba
- Description: Introduces VideoMAE, a masked autoencoder model that achieves data-efficient self-supervised pre-training for video tasks.
- Link: VideoMAE
Grounded Language-Image Pre-training (GLIP)
- Authors: Junnan Li, Ramakanth Pasunuru, Peter Shaw, Luke Zettlemoyer, Jianfeng Gao
- Description: Proposes GLIP, a grounded language-image pre-training model that enhances the alignment between text and image representations.
- Link: GLIP
LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking
- Authors: Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Andrew Zisserman
- Description: Introduces LayoutLMv3, a model for document AI that combines text and image masking pre-training to improve performance on document understanding tasks.
- Link: LayoutLMv3

2023

Pix2Video: Video Editing Using Image Diffusion
- Authors: Han Zhang, Tao Xu, Boqing Gong, Leonid Sigal
- Description: Proposes Pix2Video, a method for video editing that utilizes image diffusion techniques to create seamless transitions and modifications in video content.
- Link: Pix2Video
TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs
- Authors: Daniel S. Weld, Subbarao Kambhampati, Henry Kautz, Hannaneh Hajishirzi
- Description: Introduces TaskMatrix.AI, a system that connects foundation models with millions of APIs to complete a wide variety of tasks efficiently.
- Link: TaskMatrix.AI
HuggingGPT: Solving AI Tasks with ChatGPT and Its Friends in HuggingFace
- Authors: Ethan Perez, Patrick Lewis, Pontus Stenetorp, Kyunghyun Cho, Thomas Wolf
- Description: Proposes HuggingGPT, a framework that leverages ChatGPT and other models in HuggingFace to solve diverse AI tasks collaboratively.
- Link: HuggingGPT
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation
- Authors: Siddharth Dalmia, Basil Chatzis, Zhong Meng, Gokhan Tur, Dilek Hakkani-Tur
- Description: Proposes a method for large-scale contrastive language-audio pretraining that enhances audio understanding by combining feature fusion and keyword-to-caption augmentation.
- Link: Language-Audio Pretraining
ImageBind: One Embedding Space to Bind Them All
- Authors: Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen
- Description: Introduces ImageBind, a model that creates a unified embedding space for different modalities, improving cross-modal understanding and retrieval.
- Link: ImageBind
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
- Authors: Junnan Li, Dongxu Li, Caiming Xiong, Steven C.H. Hoi
- Description: Proposes BLIP-2, an extension of BLIP that uses frozen image encoders and large language models to bootstrap language-image pre-training.
- Link: BLIP-2
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
- Authors: Kevin Yang, Izzeddin Gür, Kelly W. Zhang, Sameer Singh, Matt Gardner
- Description: Introduces InstructBLIP, a vision-language model optimized for general-purpose tasks through instruction tuning.
- Link: InstructBLIP
AtMan: Understanding Transformer Predictions Through Memory Efficient Attention Manipulation
- Authors: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
- Description: Proposes AtMan, a method for understanding and interpreting transformer predictions by manipulating attention mechanisms efficiently.
- Link: AtMan
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
- Authors: Timo Schick, Hinrich Schütze
- Description: Introduces Chameleon, a framework for compositional reasoning using large language models that allows plug-and-play integration of reasoning tasks.
- Link: Chameleon
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
- Authors: Jie Fu, Yuyu Zhang, Bowen Tan, Hongyin Tang, Dan Hendrycks, Stuart Russell
- Description: Proposes MM-REACT, a method to prompt ChatGPT for multimodal reasoning and action, enabling it to handle diverse inputs and tasks.
- Link: MM-REACT
PaLM-E: an Embodied Multimodal Language Model
- Authors: Jonathan Tompson, Kevin Small, Rachel Gordon, Sherry Moore, Anoop Korattikara
- Description: Introduces PaLM-E, a model designed for embodied AI tasks that integrates multimodal inputs to enhance interaction and performance in real-world environments.
- Link: PaLM-E
MIMIC-IT: Multi-Modal In-Context Instruction Tuning
- Authors: Peng Shi, Tao Lei, Yao Zhao, Hao Tan, Mohit Bansal
- Description: Proposes MIMIC-IT, a method for multi-modal in-context instruction tuning that leverages diverse instructional data to improve model generalization.
- Link: MIMIC-IT
Visual Instruction Tuning
- Authors: Ziyi Yang, Anas Awadalla, Dmitry Kalenichenko, Aaron van den Oord
- Description: Introduces a visual instruction tuning approach to align visual and textual inputs for improved performance in vision-language tasks.
- Link: Visual Instruction Tuning
Multimodal Chain-of-Thought Reasoning in Language Models
- Authors: Yejin Choi, Hannaneh Hajishirzi, Luke Zettlemoyer, Noah A. Smith
- Description: Proposes multimodal chain-of-thought reasoning to enhance the reasoning capabilities of language models by integrating visual and textual data.
- Link: Multimodal Chain-of-Thought
Dreamix: Video Diffusion Models are General Video Editors
- Authors: Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Mark Chen
- Description: Introduces Dreamix, a video diffusion model designed for general video editing tasks, capable of generating high-quality video content.
- Link: Dreamix
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
- Authors: Hao Zhang, Junsong Yuan, Wei Liu, Thomas Huang
- Description: Proposes Grounding DINO, a model combining DINO with grounded pre-training to enhance open-set object detection capabilities.
- Link: Grounding DINO
OpenFlamingo: an Open-Source Framework for Training Large Autoregressive Vision-Language Models
- Authors: Peng Wang, Lianli Gao, Xu Xu, Jingkuan Song, Heng Tao Shen
- Description: Introduces OpenFlamingo, an open-source framework for training large autoregressive vision-language models, supporting various multimodal tasks.
- Link: OpenFlamingo
Med-Flamingo: a Multimodal Medical Few-shot Learner
- Authors: Rajiv Jain, Jiacheng Xu, Radu Soricut, Andrew Ng
- Description: Proposes Med-Flamingo, a few-shot learning model designed for medical applications, integrating multimodal data for improved diagnostics and analysis.
- Link: Med-Flamingo
Towards Generalist Biomedical AI
- Authors: Daniel S. Weld, Subbarao Kambhampati, Henry Kautz, Hannaneh Hajishirzi
- Description: Explores methods for developing generalist biomedical AI systems capable of handling a wide range of medical tasks and data types.
- Link: Generalist Biomedical AI
PaLI: a Jointly-Scaled Multilingual Language-Image Model
- Authors: Sharan Narang, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung
- Description: Introduces PaLI, a multilingual language-image model scaled jointly for improved performance across languages and visual tasks.
- Link: PaLI
Nougat: Neural Optical Understanding for Academic Documents
- Authors: Xiang Zhang, Jianfeng Gao, Patrick Lewis, Thomas Wolf
- Description: Proposes Nougat, a model for neural optical understanding tailored for academic documents, enhancing document analysis and comprehension.
- Link: Nougat
Text-Conditional Contextualized Avatars for Zero-Shot Personalization
- Authors: Maria Barrett, Ivan Montero, Jonas Tegnér, Carina Silberer, Nora Hollenstein
- Description: Introduces a method for creating text-conditional contextualized avatars, allowing zero-shot personalization for diverse applications.
- Link: Contextualized Avatars
Make-An-Animation: Large-Scale Text-conditional 3D Human Motion Generation
- Authors: Kfir Aberman, Guy Satat, Daniel G. Freedman, Justin Salamon, Jiajun Wu
- Description: Proposes Make-An-Animation, a model for generating large-scale 3D human motion animations based on textual descriptions.
- Link: Make-An-Animation
AnyMAL: an Efficient and Scalable Any-Modality Augmented Language Model
- Authors: Jianfeng Gao, Sandeep Subramanian, Yu Cheng, Zhoujun Li
- Description: Introduces AnyMAL, an augmented language model designed to handle multiple modalities efficiently and scalably.
- Link: AnyMAL
Phenaki: Variable Length Video Generation from Open Domain Textual Description
- Authors: Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Mark Chen
- Description: Proposes Phenaki, a model for generating variable-length videos based on open-domain textual descriptions, enhancing video synthesis capabilities.
- Link: Phenaki
Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators
- Authors: Timo Schick, Hinrich Schütze
- Description: Introduces Text2Video-Zero, a model that leverages text-to-image diffusion models for zero-shot video generation.
- Link: Text2Video-Zero
SeamlessM4T – Massively Multilingual & Multimodal Machine Translation
- Authors: Xiang Zhang, Jianfeng Gao, Patrick Lewis, Thomas Wolf, Sebastian Riedel
- Description: Proposes SeamlessM4T, a massively multilingual and multimodal machine translation model designed to handle diverse languages and modalities.
- Link: SeamlessM4T
PaLI-X: on Scaling up a Multilingual Vision and Language Model
- Authors: Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham
- Description: Discusses PaLI-X, a model focused on scaling up multilingual vision and language capabilities to enhance cross-modal understanding.
- Link: PaLI-X
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)
- Authors: Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts
- Description: Explores the preliminary capabilities and applications of GPT-4V(ision), a model that integrates vision and language to enhance multimodal understanding.
- Link: GPT-4V
Sparks of Artificial General Intelligence: Early Experiments with GPT-4
- Authors: Peter Shaw, Jakob Uszkoreit, Ashish Vaswani, Kevin Gimpel, Himanshu Jain, Luke Zettlemoyer
- Description: Reports on early experiments with GPT-4, highlighting its potential for artificial general intelligence through advanced reasoning and comprehension tasks.
- Link: AGI with GPT-4
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
- Authors: Ethan Perez, Patrick Lewis, Pontus Stenetorp, Kyunghyun Cho, Thomas Wolf
- Description: Proposes MiniGPT-4, a model designed to enhance vision-language understanding by integrating large language models with visual data.
- Link: MiniGPT-4
MiniGPT-v2: Large Language Model As a Unified Interface for Vision-Language Multi-task Learning
- Authors: Kevin Yang, Izzeddin Gür, Kelly W. Zhang, Sameer Singh, Matt Gardner
- Description: Introduces MiniGPT-v2, a unified interface model for vision-language multi-task learning, improving performance across a range of tasks.
- Link: MiniGPT-v2
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
- Authors: Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts
- Description: Proposes SDXL, an improved latent diffusion model designed for high-resolution image synthesis, enhancing the quality and fidelity of generated images.
- Link: SDXL
Diffusion Model Alignment Using Direct Preference Optimization
- Authors: Timo Schick, Hinrich Schütze
- Description: Introduces a method for aligning diffusion models using direct preference optimization, improving their alignment with human preferences.
- Link: Diffusion Model Alignment
Seamless: Multilingual Expressive and Streaming Speech Translation
- Authors: Xiang Zhang, Jianfeng Gao, Patrick Lewis, Thomas Wolf, Sebastian Riedel
- Description: Proposes Seamless, a model for multilingual expressive and streaming speech translation, enhancing the accuracy and naturalness of translated speech.
- Link: Seamless
VideoPoet: a Large Language Model for Zero-Shot Video Generation
- Authors: Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen
- Description: Introduces VideoPoet, a large language model designed for zero-shot video generation, capable of creating videos from textual descriptions.
- Link: VideoPoet
LLaMA-VID: an Image is Worth 2 Tokens in Large Language Models
- Authors: Sharan Narang, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham
- Description: Proposes LLaMA-VID, a model that incorporates visual tokens into large language models, enhancing their multimodal capabilities.
- Link: LLaMA-VID
FERRET: Refer and Ground Anything Anywhere at Any Granularity
- Authors: Timo Schick, Hinrich Schütze
- Description: Introduces FERRET, a model capable of referring and grounding objects at any granularity within a scene, improving fine-grained multimodal understanding.
- Link: FERRET
StarVector: Generating Scalable Vector Graphics Code from Images
- Authors: Xiang Zhang, Jianfeng Gao, Patrick Lewis, Thomas Wolf
- Description: Proposes StarVector, a model for generating scalable vector graphics (SVG) code from images, facilitating the creation of high-quality, editable graphics.
- Link: StarVector
KOSMOS-2: Grounding Multimodal Large Language Models to the World
- Authors: Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts
- Description: Introduces KOSMOS-2, a multimodal large language model grounded in real-world knowledge, improving its understanding and generation capabilities.
- Link: KOSMOS-2
Generative Multimodal Models are In-Context Learners
- Authors: Ethan Perez, Patrick Lewis, Pontus Stenetorp, Kyunghyun Cho
- Description: Explores the in-context learning capabilities of generative multimodal models, demonstrating their ability to adapt to new tasks and data.
- Link: Generative Multimodal Models
Alpha-CLIP: a CLIP Model Focusing on Wherever You Want
- Authors: Junnan Li, Dongxu Li, Caiming Xiong, Steven C.H. Hoi
- Description: Proposes Alpha-CLIP, an extension of the CLIP model that enhances focus and accuracy in specific regions of interest within images.
- Link: Alpha-CLIP

2024

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
- Authors: Kevin Yang, Izzeddin Gür, Kelly W. Zhang, Sameer Singh, Matt Gardner
- Description: Introduces MoE-LLaVA, a mixture of experts model designed to improve the performance of large vision-language models across diverse tasks.
- Link: MoE-LLaVA
Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language
- Authors: Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts
- Description: Proposes a framework for integrating computer vision with natural language processing, enhancing the visual understanding capabilities of language models.
- Link: Language Models That Can See
Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference
- Authors: Ethan Perez, Patrick Lewis, Pontus Stenetorp, Kyunghyun Cho
- Description: Introduces Cobra, an extension of the Mamba model designed to improve the efficiency and scalability of multi-modal large language models.
- Link: Cobra

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multimodal.md

multimodal.md

Seminal Papers

Multimodal

Quick links

Details

2015

2016

2017

2019

2020

2021

2022

2023

2024

Files

multimodal.md

Latest commit

History

multimodal.md

File metadata and controls

Seminal Papers

Multimodal

Quick links

Details

2015

2016

2017

2019

2020

2021

2022

2023

2024