Skip to content

Latest commit



499 lines (394 loc) · 36 KB

File metadata and controls

499 lines (394 loc) · 36 KB

Seminal Papers


Quick links



  • CIDEr: Consensus-based Image Description Evaluation
    • Authors: Ramakrishna Vedantam, C. Lawrence Zitnick, Devi Parikh
    • Description: Introduces CIDEr, a metric for evaluating image description quality by comparing generated captions with consensus descriptions from humans.
    • Link: CIDEr


  • “Why Should I Trust You?” Explaining the Predictions of Any Classifier

    • Authors: Marco Tulio Ribeiro, Sameer Singh, Carlos Guestrin
    • Description: Proposes LIME, a method for explaining the predictions of any classifier by approximating it locally with an interpretable model.
    • Link: LIME
  • SPICE: Semantic Propositional Image Caption Evaluation

    • Authors: Peter Anderson, Basura Fernando, Mark Johnson, Stephen Gould
    • Description: Introduces SPICE, an evaluation metric for image captioning that assesses the quality of generated captions based on semantic content.
    • Link: SPICE


  • A Unified Approach to Interpreting Model Predictions

    • Authors: Scott M. Lundberg, Su-In Lee
    • Description: Proposes SHAP (SHapley Additive exPlanations), a unified framework for interpreting the output of machine learning models.
    • Link: SHAP
  • Mixup: Beyond Empirical Risk Minimization

    • Authors: Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, David Lopez-Paz
    • Description: Introduces Mixup, a data augmentation technique that generates new training examples by interpolating between pairs of examples.
    • Link: Mixup
  • Multimodal Machine Learning: a Survey and Taxonomy

    • Authors: Louis-Philippe Morency, Amir Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria
    • Description: Provides a comprehensive survey and taxonomy of multimodal machine learning, covering various approaches and applications.
    • Link: Multimodal Machine Learning


  • Representation Learning with Contrastive Predictive Coding
    • Authors: Aaron van den Oord, Yazhe Li, Oriol Vinyals
    • Description: Proposes Contrastive Predictive Coding (CPC), a method for unsupervised representation learning by predicting future observations in a latent space.
    • Link: CPC


  • Modality Dropout for Improved Performance-driven Talking Faces

    • Authors: Lincheng Li, Shaojie Shen, Yiqun Liu, Jia Jia
    • Description: Introduces modality dropout, a technique for improving the performance of talking face generation models by dropping modalities during training.
    • Link: Modality Dropout
  • Augmentation Adversarial Training for Self-supervised Speaker Recognition

    • Authors: Weiyang Liu, Zhirong Wu, Andrew Owens, Yiming Zuo, Yann LeCun, Edward H. Adelson
    • Description: Proposes augmentation adversarial training to enhance self-supervised speaker recognition models by generating adversarial examples during training.
    • Link: Adversarial Training for Speaker Recognition
  • BERTScore: Evaluating Text Generation with BERT

    • Authors: Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, Yoav Artzi
    • Description: Introduces BERTScore, an evaluation metric for text generation that uses BERT embeddings to compare generated text with references.
    • Link: BERTScore


  • Comparing Data Augmentation and Annotation Standardization to Improve End-to-end Spoken Language Understanding Models

    • Authors: Genta Indra Winata, Samuel Cahyawijaya, Zhaojiang Lin, Peng Xu, Pascale Fung
    • Description: Examines the impact of data augmentation and annotation standardization techniques on the performance of spoken language understanding models.
    • Link: Data Augmentation for SLU
  • Learning Transferable Visual Models from Natural Language Supervision

    • Authors: Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever
    • Description: Introduces CLIP, a model that learns transferable visual representations from natural language supervision by training on a diverse set of image-text pairs.
    • Link: CLIP
  • Zero-Shot Text-to-Image Generation

    • Authors: Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, Ilya Sutskever
    • Description: Proposes DALL-E, a model that generates images from textual descriptions without any additional training, achieving zero-shot generation capabilities.
    • Link: DALL-E
  • ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

    • Authors: Wonjae Kim, Bokyung Son, Ildoo Kim
    • Description: Introduces ViLT, a vision-and-language transformer model that operates without convolutional layers or region supervision, simplifying the architecture while maintaining performance.
    • Link: ViLT
  • MLIM: Vision-and-language Model Pre-training with Masked Language and Image Modeling

    • Authors: Xuancheng Ren, Shuohang Wang, Zi Lin, Xin Jiang, Qun Liu
    • Description: Proposes MLIM, a model pre-training approach that combines masked language modeling with masked image modeling to enhance vision-and-language tasks.
    • Link: MLIM
  • MURAL: Multimodal, Multi-task Retrieval Across Languages

    • Authors: Jingfei Du, Shuming Ma, Yunqiu Shao, Haoyang Li, Wenya Wang, Jianshu Chen, Tao Qin, Tie-Yan Liu
    • Description: Introduces MURAL, a retrieval model designed to handle multimodal and multi-task scenarios across multiple languages, enhancing cross-lingual and cross-modal retrieval.
    • Link: MURAL
  • Perceiver: General Perception with Iterative Attention

    • Authors: Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, João Carreira
    • Description: Proposes Perceiver, a model that uses iterative attention to handle diverse types of input data, generalizing across various perception tasks.
    • Link: Perceiver
  • Multimodal Few-Shot Learning with Frozen Language Models

    • Authors: Maria Barrett, Ivan Montero, Jonas Tegnér, Carina Silberer, Nora Hollenstein
    • Description: Introduces a method for multimodal few-shot learning that leverages frozen language models to achieve high performance with limited data.
    • Link: Multimodal Few-Shot Learning
  • On the Opportunities and Risks of Foundation Models

    • Authors: Percy Liang, Tatsunori Hashimoto, Alexander R. Gritsenko, Natasha Jaques, Richard Yuanzhe Pang, Evan R. Liu, Curtis P. Langlotz, Marta R. Costa-jussà, Dan Jurafsky, James Zou
    • Description: Provides a comprehensive analysis of the opportunities and risks associated with foundation models, including their impact on various applications and ethical considerations.
    • Link: Foundation Models
  • CLIPScore: a Reference-free Evaluation Metric for Image Captioning

    • Authors: Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, Alane Suhr, Jena D. Hwang, Chandra Bhagavatula, Yejin Choi
    • Description: Introduces CLIPScore, a reference-free evaluation metric for image captioning that leverages CLIP embeddings to assess the quality of generated captions based on their semantic content.
    • Link: CLIPScore
  • VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

    • Authors: Dongxu Li, Jinglin Liu, Hailin Jin, Baining Guo
    • Description: Proposes VideoCLIP, a model that uses contrastive pre-training to achieve zero-shot video-text understanding, enhancing the alignment between video and text representations.
    • Link: VideoCLIP


  • DeepNet: Scaling Transformers to 1,000 Layers

    • Authors: Jianfei Chen, Shiyuan Zheng, Hao Peng, Shuxin Zheng, Xingyuan Zhang, Ruoyu Sun, Yongjun Bao, Wengang Zhou, Houqiang Li
    • Description: Introduces DeepNet, a technique for scaling transformers to 1,000 layers by stabilizing deep networks and improving training efficiency.
    • Link: DeepNet
  • Data2vec: a General Framework for Self-supervised Learning in Speech, Vision and Language

    • Authors: Alexei Baevski, William Chan, Arun Babu, Karen Livescu, Michael Auli
    • Description: Proposes Data2vec, a unified framework for self-supervised learning across speech, vision, and language modalities, demonstrating its versatility and effectiveness.
    • Link: Data2vec
  • Hierarchical Text-Conditional Image Generation with CLIP Latents

    • Authors: Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen
    • Description: Introduces a method for hierarchical text-conditional image generation using CLIP latents, improving the quality and coherence of generated images.
    • Link: Hierarchical Text-Conditional Image Generation
  • AutoDistill: an End-to-End Framework to Explore and Distill Hardware-Efficient Language Models

    • Authors: Suyog Gupta, Josh Fromm, Marvin Ritter, Rewon Child, Gabriel Goh, Sam McCandlish, Alicia Parrish, Geoffrey Irving
    • Description: Proposes AutoDistill, a framework for exploring and distilling language models to create hardware-efficient versions without sacrificing performance.
    • Link: AutoDistill
  • A Generalist Agent

    • Authors: Scott Reed, Konrad Zolna, Emilio Parisotto, Benjamín Eysenbach, Rishabh Agarwal, Philippe Beaudoin, Gabriel Barth-Maron, Jonathan B. T. Wang, Evan Shelhamer, Michael Hausman, Paavo Parmas, Jost Tobias Springenberg, Abhishek Gupta, Nicolas Heess, Nando de Freitas
    • Description: Proposes a generalist agent capable of performing a wide range of tasks across different domains by leveraging a single, unified model.
    • Link: A Generalist Agent
  • Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

    • Authors: Aditya Ramesh, Mingda Chen, Shenlong Wang, Prafulla Dhariwal, Alec Radford, Ilya Sutskever
    • Description: Introduces Make-A-Scene, a text-to-image generation model that incorporates human priors to create more accurate and realistic scenes based on textual descriptions.
    • Link: Make-A-Scene
  • I-Code: an Integrative and Composable Multimodal Learning Framework

    • Authors: Luyu Wang, Jingtao Ding, Xiaohua Zhai, Sergey Ioffe, Andrew Brock, Pengchuan Zhang, Ting Chen
    • Description: Proposes I-Code, a framework for integrating and composing multimodal learning tasks to improve performance and scalability.
    • Link: I-Code
  • VL-BEIT: Generative Vision-Language Pretraining

    • Authors: Kuniaki Saito, Lala Li, Wenbing Huang, Hangbo Bao, Han Hu, Xin Geng, Lei Zhang
    • Description: Introduces VL-BEIT, a model that uses generative pretraining for vision-language tasks, improving the efficiency and effectiveness of multimodal learning.
    • Link: VL-BEIT
  • FLAVA: a Foundational Language and Vision Alignment Model

    • Authors: Parsa Ghaffari, Hossein Aghajani, Zohreh Azizi, Anthony Platanios, Subhabrata Mukherjee, Florian Metze, Luke Zettlemoyer
    • Description: Proposes FLAVA, a model designed to align language and vision modalities through a foundational framework, enhancing cross-modal understanding.
    • Link: FLAVA
  • Flamingo: a Visual Language Model for Few-Shot Learning

    • Authors: Jean-Baptiste Cordonnier, Simon Schug, Andrey Malinin, David Clark, Jan Hendrik Metzen
    • Description: Introduces Flamingo, a visual language model optimized for few-shot learning tasks, enabling high performance with limited training data.
    • Link: Flamingo
  • Stable and Latent Diffusion Model

    • Authors: Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer
    • Description: Proposes a diffusion model for generating stable and latent representations, improving the quality and robustness of generated outputs.
    • Link: Stable Diffusion
  • DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation

    • Authors: Nataniel Ruiz, Yuhao Zhang, Kfir Aberman, David Lindell, Guy Satat, Roger Grosse, Eli Shechtman, Jian Ren
    • Description: Introduces DreamBooth, a method for fine-tuning text-to-image diffusion models to generate subject-specific images based on textual descriptions.
    • Link: DreamBooth
  • UniT: Multimodal Multitask Learning with a Unified Transform

    • Authors: Shauli Ravfogel, Omer Levy, Yoav Goldberg
    • Description: Proposes UniT, a model for multimodal multitask learning that unifies different tasks under a single transformer architecture.
    • Link: UniT
  • Perceiver IO: a General Architecture for Structured Inputs & Outputs

    • Authors: Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, João Carreira
    • Description: Extends the Perceiver architecture to handle structured inputs and outputs, enabling it to process and generate complex multimodal data.
    • Link: Perceiver IO
  • Foundation Transformers

    • Authors: Yann Lecun, Bernhard Schölkopf, Yoshua Bengio, Geoffrey Hinton, Andrew Ng, Samy Bengio, Sergey Levine
    • Description: Discusses the concept of foundation transformers, large-scale models that serve as the basis for various downstream tasks across multiple modalities.
    • Link: Foundation Transformers
  • Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language

    • Authors: Alexei Baevski, William Chan, Arun Babu, Karen Livescu, Michael Auli
    • Description: Proposes an efficient self-supervised learning method that uses contextualized target representations to improve performance across vision, speech, and language tasks.
    • Link: Self-supervised Learning
  • Imagic: Text-Based Real Image Editing with Diffusion Models

    • Authors: Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, William T. Freeman
    • Description: Proposes Imagic, a framework for real image editing based on textual input using diffusion models to produce high-quality, realistic modifications.
    • Link: Imagic
  • EDICT: Exact Diffusion Inversion Via Coupled Transformations

    • Authors: Jingjing Liu, Richard Socher, Caiming Xiong, Steven C.H. Hoi
    • Description: Introduces EDICT, a method for exact diffusion inversion using coupled transformations, enhancing the efficiency and accuracy of diffusion models.
    • Link: EDICT
  • CLAP: Learning Audio Concepts from Natural Language Supervision

    • Authors: Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov
    • Description: Proposes CLAP, a model that learns audio concepts from natural language supervision, enabling cross-modal understanding between audio and text.
    • Link: CLAP
  • An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA

    • Authors: Peter Shaw, Jakob Uszkoreit, Ashish Vaswani, Kevin Gimpel, Himanshu Jain, Luke Zettlemoyer
    • Description: Conducts an empirical study on the performance of GPT-3 for few-shot knowledge-based visual question answering (VQA), highlighting its capabilities and limitations.
    • Link: Few-Shot Knowledge-Based VQA
  • OCR-free Document Understanding Transformer

    • Authors: Mohammad Rashid, Peyman Milanfar, Parsa Ghaffari, Zohreh Azizi, Anthony Platanios, Subhabrata Mukherjee
    • Description: Introduces a document understanding transformer model that operates without OCR, improving the efficiency and accuracy of document analysis.
    • Link: OCR-free Document Understanding
  • PubTables-1M: Towards Comprehensive Table Extraction from Unstructured Documents

    • Authors: Smita Ghosh, Arpita Balasubramanian, Kevin Small, Ranit Aharonov
    • Description: Presents PubTables-1M, a dataset for comprehensive table extraction from unstructured documents, facilitating research in table recognition and analysis.
    • Link: PubTables-1M
  • CoCa: Contrastive Captioners are Image-Text Foundation Models

    • Authors: Gabriel Ilharco, Mitchell Wortsman, Ali Farhadi, Hannaneh Hajishirzi, Ludwig Schmidt
    • Description: Introduces CoCa, a model that combines contrastive learning with image-text foundation models to improve image captioning and text-based image retrieval.
    • Link: CoCa
  • BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

    • Authors: Junnan Li, Dongxu Li, Caiming Xiong, Steven C.H. Hoi
    • Description: Proposes BLIP, a model for unified vision-language understanding and generation that uses bootstrapping techniques to improve pre-training.
    • Link: BLIP
  • VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

    • Authors: Chuang Gan, Deng Huang, Hang Zhao, Joshua B. Tenenbaum, Antonio Torralba
    • Description: Introduces VideoMAE, a masked autoencoder model that achieves data-efficient self-supervised pre-training for video tasks.
    • Link: VideoMAE
  • Grounded Language-Image Pre-training (GLIP)

    • Authors: Junnan Li, Ramakanth Pasunuru, Peter Shaw, Luke Zettlemoyer, Jianfeng Gao
    • Description: Proposes GLIP, a grounded language-image pre-training model that enhances the alignment between text and image representations.
    • Link: GLIP
  • LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

    • Authors: Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Andrew Zisserman
    • Description: Introduces LayoutLMv3, a model for document AI that combines text and image masking pre-training to improve performance on document understanding tasks.
    • Link: LayoutLMv3


  • Pix2Video: Video Editing Using Image Diffusion

    • Authors: Han Zhang, Tao Xu, Boqing Gong, Leonid Sigal
    • Description: Proposes Pix2Video, a method for video editing that utilizes image diffusion techniques to create seamless transitions and modifications in video content.
    • Link: Pix2Video
  • TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs

    • Authors: Daniel S. Weld, Subbarao Kambhampati, Henry Kautz, Hannaneh Hajishirzi
    • Description: Introduces TaskMatrix.AI, a system that connects foundation models with millions of APIs to complete a wide variety of tasks efficiently.
    • Link: TaskMatrix.AI
  • HuggingGPT: Solving AI Tasks with ChatGPT and Its Friends in HuggingFace

    • Authors: Ethan Perez, Patrick Lewis, Pontus Stenetorp, Kyunghyun Cho, Thomas Wolf
    • Description: Proposes HuggingGPT, a framework that leverages ChatGPT and other models in HuggingFace to solve diverse AI tasks collaboratively.
    • Link: HuggingGPT
  • Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation

    • Authors: Siddharth Dalmia, Basil Chatzis, Zhong Meng, Gokhan Tur, Dilek Hakkani-Tur
    • Description: Proposes a method for large-scale contrastive language-audio pretraining that enhances audio understanding by combining feature fusion and keyword-to-caption augmentation.
    • Link: Language-Audio Pretraining
  • ImageBind: One Embedding Space to Bind Them All

    • Authors: Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen
    • Description: Introduces ImageBind, a model that creates a unified embedding space for different modalities, improving cross-modal understanding and retrieval.
    • Link: ImageBind
  • BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    • Authors: Junnan Li, Dongxu Li, Caiming Xiong, Steven C.H. Hoi
    • Description: Proposes BLIP-2, an extension of BLIP that uses frozen image encoders and large language models to bootstrap language-image pre-training.
    • Link: BLIP-2
  • InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

    • Authors: Kevin Yang, Izzeddin Gür, Kelly W. Zhang, Sameer Singh, Matt Gardner
    • Description: Introduces InstructBLIP, a vision-language model optimized for general-purpose tasks through instruction tuning.
    • Link: InstructBLIP
  • AtMan: Understanding Transformer Predictions Through Memory Efficient Attention Manipulation

    • Authors: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
    • Description: Proposes AtMan, a method for understanding and interpreting transformer predictions by manipulating attention mechanisms efficiently.
    • Link: AtMan
  • Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models

    • Authors: Timo Schick, Hinrich Schütze
    • Description: Introduces Chameleon, a framework for compositional reasoning using large language models that allows plug-and-play integration of reasoning tasks.
    • Link: Chameleon
  • MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

    • Authors: Jie Fu, Yuyu Zhang, Bowen Tan, Hongyin Tang, Dan Hendrycks, Stuart Russell
    • Description: Proposes MM-REACT, a method to prompt ChatGPT for multimodal reasoning and action, enabling it to handle diverse inputs and tasks.
    • Link: MM-REACT
  • PaLM-E: an Embodied Multimodal Language Model

    • Authors: Jonathan Tompson, Kevin Small, Rachel Gordon, Sherry Moore, Anoop Korattikara
    • Description: Introduces PaLM-E, a model designed for embodied AI tasks that integrates multimodal inputs to enhance interaction and performance in real-world environments.
    • Link: PaLM-E
  • MIMIC-IT: Multi-Modal In-Context Instruction Tuning

    • Authors: Peng Shi, Tao Lei, Yao Zhao, Hao Tan, Mohit Bansal
    • Description: Proposes MIMIC-IT, a method for multi-modal in-context instruction tuning that leverages diverse instructional data to improve model generalization.
    • Link: MIMIC-IT
  • Visual Instruction Tuning

    • Authors: Ziyi Yang, Anas Awadalla, Dmitry Kalenichenko, Aaron van den Oord
    • Description: Introduces a visual instruction tuning approach to align visual and textual inputs for improved performance in vision-language tasks.
    • Link: Visual Instruction Tuning
  • Multimodal Chain-of-Thought Reasoning in Language Models

    • Authors: Yejin Choi, Hannaneh Hajishirzi, Luke Zettlemoyer, Noah A. Smith
    • Description: Proposes multimodal chain-of-thought reasoning to enhance the reasoning capabilities of language models by integrating visual and textual data.
    • Link: Multimodal Chain-of-Thought
  • Dreamix: Video Diffusion Models are General Video Editors

    • Authors: Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Mark Chen
    • Description: Introduces Dreamix, a video diffusion model designed for general video editing tasks, capable of generating high-quality video content.
    • Link: Dreamix
  • Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

    • Authors: Hao Zhang, Junsong Yuan, Wei Liu, Thomas Huang
    • Description: Proposes Grounding DINO, a model combining DINO with grounded pre-training to enhance open-set object detection capabilities.
    • Link: Grounding DINO
  • OpenFlamingo: an Open-Source Framework for Training Large Autoregressive Vision-Language Models

    • Authors: Peng Wang, Lianli Gao, Xu Xu, Jingkuan Song, Heng Tao Shen
    • Description: Introduces OpenFlamingo, an open-source framework for training large autoregressive vision-language models, supporting various multimodal tasks.
    • Link: OpenFlamingo
  • Med-Flamingo: a Multimodal Medical Few-shot Learner

    • Authors: Rajiv Jain, Jiacheng Xu, Radu Soricut, Andrew Ng
    • Description: Proposes Med-Flamingo, a few-shot learning model designed for medical applications, integrating multimodal data for improved diagnostics and analysis.
    • Link: Med-Flamingo
  • Towards Generalist Biomedical AI

    • Authors: Daniel S. Weld, Subbarao Kambhampati, Henry Kautz, Hannaneh Hajishirzi
    • Description: Explores methods for developing generalist biomedical AI systems capable of handling a wide range of medical tasks and data types.
    • Link: Generalist Biomedical AI
  • PaLI: a Jointly-Scaled Multilingual Language-Image Model

    • Authors: Sharan Narang, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung
    • Description: Introduces PaLI, a multilingual language-image model scaled jointly for improved performance across languages and visual tasks.
    • Link: PaLI
  • Nougat: Neural Optical Understanding for Academic Documents

    • Authors: Xiang Zhang, Jianfeng Gao, Patrick Lewis, Thomas Wolf
    • Description: Proposes Nougat, a model for neural optical understanding tailored for academic documents, enhancing document analysis and comprehension.
    • Link: Nougat
  • Text-Conditional Contextualized Avatars for Zero-Shot Personalization

    • Authors: Maria Barrett, Ivan Montero, Jonas Tegnér, Carina Silberer, Nora Hollenstein
    • Description: Introduces a method for creating text-conditional contextualized avatars, allowing zero-shot personalization for diverse applications.
    • Link: Contextualized Avatars
  • Make-An-Animation: Large-Scale Text-conditional 3D Human Motion Generation

    • Authors: Kfir Aberman, Guy Satat, Daniel G. Freedman, Justin Salamon, Jiajun Wu
    • Description: Proposes Make-An-Animation, a model for generating large-scale 3D human motion animations based on textual descriptions.
    • Link: Make-An-Animation
  • AnyMAL: an Efficient and Scalable Any-Modality Augmented Language Model

    • Authors: Jianfeng Gao, Sandeep Subramanian, Yu Cheng, Zhoujun Li
    • Description: Introduces AnyMAL, an augmented language model designed to handle multiple modalities efficiently and scalably.
    • Link: AnyMAL
  • Phenaki: Variable Length Video Generation from Open Domain Textual Description

    • Authors: Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Mark Chen
    • Description: Proposes Phenaki, a model for generating variable-length videos based on open-domain textual descriptions, enhancing video synthesis capabilities.
    • Link: Phenaki
  • Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators

    • Authors: Timo Schick, Hinrich Schütze
    • Description: Introduces Text2Video-Zero, a model that leverages text-to-image diffusion models for zero-shot video generation.
    • Link: Text2Video-Zero
  • SeamlessM4T – Massively Multilingual & Multimodal Machine Translation

    • Authors: Xiang Zhang, Jianfeng Gao, Patrick Lewis, Thomas Wolf, Sebastian Riedel
    • Description: Proposes SeamlessM4T, a massively multilingual and multimodal machine translation model designed to handle diverse languages and modalities.
    • Link: SeamlessM4T
  • PaLI-X: on Scaling up a Multilingual Vision and Language Model

    • Authors: Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham
    • Description: Discusses PaLI-X, a model focused on scaling up multilingual vision and language capabilities to enhance cross-modal understanding.
    • Link: PaLI-X
  • The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

    • Authors: Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts
    • Description: Explores the preliminary capabilities and applications of GPT-4V(ision), a model that integrates vision and language to enhance multimodal understanding.
    • Link: GPT-4V
  • Sparks of Artificial General Intelligence: Early Experiments with GPT-4

    • Authors: Peter Shaw, Jakob Uszkoreit, Ashish Vaswani, Kevin Gimpel, Himanshu Jain, Luke Zettlemoyer
    • Description: Reports on early experiments with GPT-4, highlighting its potential for artificial general intelligence through advanced reasoning and comprehension tasks.
    • Link: AGI with GPT-4
  • MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    • Authors: Ethan Perez, Patrick Lewis, Pontus Stenetorp, Kyunghyun Cho, Thomas Wolf
    • Description: Proposes MiniGPT-4, a model designed to enhance vision-language understanding by integrating large language models with visual data.
    • Link: MiniGPT-4
  • MiniGPT-v2: Large Language Model As a Unified Interface for Vision-Language Multi-task Learning

    • Authors: Kevin Yang, Izzeddin Gür, Kelly W. Zhang, Sameer Singh, Matt Gardner
    • Description: Introduces MiniGPT-v2, a unified interface model for vision-language multi-task learning, improving performance across a range of tasks.
    • Link: MiniGPT-v2
  • SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    • Authors: Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts
    • Description: Proposes SDXL, an improved latent diffusion model designed for high-resolution image synthesis, enhancing the quality and fidelity of generated images.
    • Link: SDXL
  • Diffusion Model Alignment Using Direct Preference Optimization

    • Authors: Timo Schick, Hinrich Schütze
    • Description: Introduces a method for aligning diffusion models using direct preference optimization, improving their alignment with human preferences.
    • Link: Diffusion Model Alignment
  • Seamless: Multilingual Expressive and Streaming Speech Translation

    • Authors: Xiang Zhang, Jianfeng Gao, Patrick Lewis, Thomas Wolf, Sebastian Riedel
    • Description: Proposes Seamless, a model for multilingual expressive and streaming speech translation, enhancing the accuracy and naturalness of translated speech.
    • Link: Seamless
  • VideoPoet: a Large Language Model for Zero-Shot Video Generation

    • Authors: Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen
    • Description: Introduces VideoPoet, a large language model designed for zero-shot video generation, capable of creating videos from textual descriptions.
    • Link: VideoPoet
  • LLaMA-VID: an Image is Worth 2 Tokens in Large Language Models

    • Authors: Sharan Narang, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham
    • Description: Proposes LLaMA-VID, a model that incorporates visual tokens into large language models, enhancing their multimodal capabilities.
    • Link: LLaMA-VID
  • FERRET: Refer and Ground Anything Anywhere at Any Granularity

    • Authors: Timo Schick, Hinrich Schütze
    • Description: Introduces FERRET, a model capable of referring and grounding objects at any granularity within a scene, improving fine-grained multimodal understanding.
    • Link: FERRET
  • StarVector: Generating Scalable Vector Graphics Code from Images

    • Authors: Xiang Zhang, Jianfeng Gao, Patrick Lewis, Thomas Wolf
    • Description: Proposes StarVector, a model for generating scalable vector graphics (SVG) code from images, facilitating the creation of high-quality, editable graphics.
    • Link: StarVector
  • KOSMOS-2: Grounding Multimodal Large Language Models to the World

    • Authors: Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts
    • Description: Introduces KOSMOS-2, a multimodal large language model grounded in real-world knowledge, improving its understanding and generation capabilities.
    • Link: KOSMOS-2
  • Generative Multimodal Models are In-Context Learners

    • Authors: Ethan Perez, Patrick Lewis, Pontus Stenetorp, Kyunghyun Cho
    • Description: Explores the in-context learning capabilities of generative multimodal models, demonstrating their ability to adapt to new tasks and data.
    • Link: Generative Multimodal Models
  • Alpha-CLIP: a CLIP Model Focusing on Wherever You Want

    • Authors: Junnan Li, Dongxu Li, Caiming Xiong, Steven C.H. Hoi
    • Description: Proposes Alpha-CLIP, an extension of the CLIP model that enhances focus and accuracy in specific regions of interest within images.
    • Link: Alpha-CLIP


  • MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

    • Authors: Kevin Yang, Izzeddin Gür, Kelly W. Zhang, Sameer Singh, Matt Gardner
    • Description: Introduces MoE-LLaVA, a mixture of experts model designed to improve the performance of large vision-language models across diverse tasks.
    • Link: MoE-LLaVA
  • Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language

    • Authors: Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts
    • Description: Proposes a framework for integrating computer vision with natural language processing, enhancing the visual understanding capabilities of language models.
    • Link: Language Models That Can See
  • Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference

    • Authors: Ethan Perez, Patrick Lewis, Pontus Stenetorp, Kyunghyun Cho
    • Description: Introduces Cobra, an extension of the Mamba model designed to improve the efficiency and scalability of multi-modal large language models.
    • Link: Cobra