- 2015 (1 paper)
- 2016 (2 papers)
- 2017 (3 papers)
- 2019 (1 paper)
- 2020 (3 papers)
- 2021 (15 papers)
- 2022 (20 papers)
- 2023 (25 papers)
- 2024 (3 papers)
- CIDEr: Consensus-based Image Description Evaluation
- Authors: Ramakrishna Vedantam, C. Lawrence Zitnick, Devi Parikh
- Description: Introduces CIDEr, a metric for evaluating image description quality by comparing generated captions with consensus descriptions from humans.
- Link: CIDEr
-
“Why Should I Trust You?” Explaining the Predictions of Any Classifier
- Authors: Marco Tulio Ribeiro, Sameer Singh, Carlos Guestrin
- Description: Proposes LIME, a method for explaining the predictions of any classifier by approximating it locally with an interpretable model.
- Link: LIME
-
SPICE: Semantic Propositional Image Caption Evaluation
- Authors: Peter Anderson, Basura Fernando, Mark Johnson, Stephen Gould
- Description: Introduces SPICE, an evaluation metric for image captioning that assesses the quality of generated captions based on semantic content.
- Link: SPICE
-
A Unified Approach to Interpreting Model Predictions
- Authors: Scott M. Lundberg, Su-In Lee
- Description: Proposes SHAP (SHapley Additive exPlanations), a unified framework for interpreting the output of machine learning models.
- Link: SHAP
-
Mixup: Beyond Empirical Risk Minimization
- Authors: Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, David Lopez-Paz
- Description: Introduces Mixup, a data augmentation technique that generates new training examples by interpolating between pairs of examples.
- Link: Mixup
-
Multimodal Machine Learning: a Survey and Taxonomy
- Authors: Louis-Philippe Morency, Amir Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria
- Description: Provides a comprehensive survey and taxonomy of multimodal machine learning, covering various approaches and applications.
- Link: Multimodal Machine Learning
- Representation Learning with Contrastive Predictive Coding
- Authors: Aaron van den Oord, Yazhe Li, Oriol Vinyals
- Description: Proposes Contrastive Predictive Coding (CPC), a method for unsupervised representation learning by predicting future observations in a latent space.
- Link: CPC
-
Modality Dropout for Improved Performance-driven Talking Faces
- Authors: Lincheng Li, Shaojie Shen, Yiqun Liu, Jia Jia
- Description: Introduces modality dropout, a technique for improving the performance of talking face generation models by dropping modalities during training.
- Link: Modality Dropout
-
Augmentation Adversarial Training for Self-supervised Speaker Recognition
- Authors: Weiyang Liu, Zhirong Wu, Andrew Owens, Yiming Zuo, Yann LeCun, Edward H. Adelson
- Description: Proposes augmentation adversarial training to enhance self-supervised speaker recognition models by generating adversarial examples during training.
- Link: Adversarial Training for Speaker Recognition
-
BERTScore: Evaluating Text Generation with BERT
- Authors: Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, Yoav Artzi
- Description: Introduces BERTScore, an evaluation metric for text generation that uses BERT embeddings to compare generated text with references.
- Link: BERTScore
-
Comparing Data Augmentation and Annotation Standardization to Improve End-to-end Spoken Language Understanding Models
- Authors: Genta Indra Winata, Samuel Cahyawijaya, Zhaojiang Lin, Peng Xu, Pascale Fung
- Description: Examines the impact of data augmentation and annotation standardization techniques on the performance of spoken language understanding models.
- Link: Data Augmentation for SLU
-
Learning Transferable Visual Models from Natural Language Supervision
- Authors: Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever
- Description: Introduces CLIP, a model that learns transferable visual representations from natural language supervision by training on a diverse set of image-text pairs.
- Link: CLIP
-
Zero-Shot Text-to-Image Generation
- Authors: Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, Ilya Sutskever
- Description: Proposes DALL-E, a model that generates images from textual descriptions without any additional training, achieving zero-shot generation capabilities.
- Link: DALL-E
-
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
- Authors: Wonjae Kim, Bokyung Son, Ildoo Kim
- Description: Introduces ViLT, a vision-and-language transformer model that operates without convolutional layers or region supervision, simplifying the architecture while maintaining performance.
- Link: ViLT
-
MLIM: Vision-and-language Model Pre-training with Masked Language and Image Modeling
- Authors: Xuancheng Ren, Shuohang Wang, Zi Lin, Xin Jiang, Qun Liu
- Description: Proposes MLIM, a model pre-training approach that combines masked language modeling with masked image modeling to enhance vision-and-language tasks.
- Link: MLIM
-
MURAL: Multimodal, Multi-task Retrieval Across Languages
- Authors: Jingfei Du, Shuming Ma, Yunqiu Shao, Haoyang Li, Wenya Wang, Jianshu Chen, Tao Qin, Tie-Yan Liu
- Description: Introduces MURAL, a retrieval model designed to handle multimodal and multi-task scenarios across multiple languages, enhancing cross-lingual and cross-modal retrieval.
- Link: MURAL
-
Perceiver: General Perception with Iterative Attention
- Authors: Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, João Carreira
- Description: Proposes Perceiver, a model that uses iterative attention to handle diverse types of input data, generalizing across various perception tasks.
- Link: Perceiver
-
Multimodal Few-Shot Learning with Frozen Language Models
- Authors: Maria Barrett, Ivan Montero, Jonas Tegnér, Carina Silberer, Nora Hollenstein
- Description: Introduces a method for multimodal few-shot learning that leverages frozen language models to achieve high performance with limited data.
- Link: Multimodal Few-Shot Learning
-
On the Opportunities and Risks of Foundation Models
- Authors: Percy Liang, Tatsunori Hashimoto, Alexander R. Gritsenko, Natasha Jaques, Richard Yuanzhe Pang, Evan R. Liu, Curtis P. Langlotz, Marta R. Costa-jussà, Dan Jurafsky, James Zou
- Description: Provides a comprehensive analysis of the opportunities and risks associated with foundation models, including their impact on various applications and ethical considerations.
- Link: Foundation Models
-
CLIPScore: a Reference-free Evaluation Metric for Image Captioning
- Authors: Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, Alane Suhr, Jena D. Hwang, Chandra Bhagavatula, Yejin Choi
- Description: Introduces CLIPScore, a reference-free evaluation metric for image captioning that leverages CLIP embeddings to assess the quality of generated captions based on their semantic content.
- Link: CLIPScore
-
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
- Authors: Dongxu Li, Jinglin Liu, Hailin Jin, Baining Guo
- Description: Proposes VideoCLIP, a model that uses contrastive pre-training to achieve zero-shot video-text understanding, enhancing the alignment between video and text representations.
- Link: VideoCLIP
-
DeepNet: Scaling Transformers to 1,000 Layers
- Authors: Jianfei Chen, Shiyuan Zheng, Hao Peng, Shuxin Zheng, Xingyuan Zhang, Ruoyu Sun, Yongjun Bao, Wengang Zhou, Houqiang Li
- Description: Introduces DeepNet, a technique for scaling transformers to 1,000 layers by stabilizing deep networks and improving training efficiency.
- Link: DeepNet
-
Data2vec: a General Framework for Self-supervised Learning in Speech, Vision and Language
- Authors: Alexei Baevski, William Chan, Arun Babu, Karen Livescu, Michael Auli
- Description: Proposes Data2vec, a unified framework for self-supervised learning across speech, vision, and language modalities, demonstrating its versatility and effectiveness.
- Link: Data2vec
-
Hierarchical Text-Conditional Image Generation with CLIP Latents
- Authors: Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen
- Description: Introduces a method for hierarchical text-conditional image generation using CLIP latents, improving the quality and coherence of generated images.
- Link: Hierarchical Text-Conditional Image Generation
-
AutoDistill: an End-to-End Framework to Explore and Distill Hardware-Efficient Language Models
- Authors: Suyog Gupta, Josh Fromm, Marvin Ritter, Rewon Child, Gabriel Goh, Sam McCandlish, Alicia Parrish, Geoffrey Irving
- Description: Proposes AutoDistill, a framework for exploring and distilling language models to create hardware-efficient versions without sacrificing performance.
- Link: AutoDistill
-
A Generalist Agent
- Authors: Scott Reed, Konrad Zolna, Emilio Parisotto, Benjamín Eysenbach, Rishabh Agarwal, Philippe Beaudoin, Gabriel Barth-Maron, Jonathan B. T. Wang, Evan Shelhamer, Michael Hausman, Paavo Parmas, Jost Tobias Springenberg, Abhishek Gupta, Nicolas Heess, Nando de Freitas
- Description: Proposes a generalist agent capable of performing a wide range of tasks across different domains by leveraging a single, unified model.
- Link: A Generalist Agent
-
Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors
- Authors: Aditya Ramesh, Mingda Chen, Shenlong Wang, Prafulla Dhariwal, Alec Radford, Ilya Sutskever
- Description: Introduces Make-A-Scene, a text-to-image generation model that incorporates human priors to create more accurate and realistic scenes based on textual descriptions.
- Link: Make-A-Scene
-
I-Code: an Integrative and Composable Multimodal Learning Framework
- Authors: Luyu Wang, Jingtao Ding, Xiaohua Zhai, Sergey Ioffe, Andrew Brock, Pengchuan Zhang, Ting Chen
- Description: Proposes I-Code, a framework for integrating and composing multimodal learning tasks to improve performance and scalability.
- Link: I-Code
-
VL-BEIT: Generative Vision-Language Pretraining
- Authors: Kuniaki Saito, Lala Li, Wenbing Huang, Hangbo Bao, Han Hu, Xin Geng, Lei Zhang
- Description: Introduces VL-BEIT, a model that uses generative pretraining for vision-language tasks, improving the efficiency and effectiveness of multimodal learning.
- Link: VL-BEIT
-
FLAVA: a Foundational Language and Vision Alignment Model
- Authors: Parsa Ghaffari, Hossein Aghajani, Zohreh Azizi, Anthony Platanios, Subhabrata Mukherjee, Florian Metze, Luke Zettlemoyer
- Description: Proposes FLAVA, a model designed to align language and vision modalities through a foundational framework, enhancing cross-modal understanding.
- Link: FLAVA
-
Flamingo: a Visual Language Model for Few-Shot Learning
- Authors: Jean-Baptiste Cordonnier, Simon Schug, Andrey Malinin, David Clark, Jan Hendrik Metzen
- Description: Introduces Flamingo, a visual language model optimized for few-shot learning tasks, enabling high performance with limited training data.
- Link: Flamingo
-
Stable and Latent Diffusion Model
- Authors: Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer
- Description: Proposes a diffusion model for generating stable and latent representations, improving the quality and robustness of generated outputs.
- Link: Stable Diffusion
-
DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation
- Authors: Nataniel Ruiz, Yuhao Zhang, Kfir Aberman, David Lindell, Guy Satat, Roger Grosse, Eli Shechtman, Jian Ren
- Description: Introduces DreamBooth, a method for fine-tuning text-to-image diffusion models to generate subject-specific images based on textual descriptions.
- Link: DreamBooth
-
UniT: Multimodal Multitask Learning with a Unified Transform
- Authors: Shauli Ravfogel, Omer Levy, Yoav Goldberg
- Description: Proposes UniT, a model for multimodal multitask learning that unifies different tasks under a single transformer architecture.
- Link: UniT
-
Perceiver IO: a General Architecture for Structured Inputs & Outputs
- Authors: Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, João Carreira
- Description: Extends the Perceiver architecture to handle structured inputs and outputs, enabling it to process and generate complex multimodal data.
- Link: Perceiver IO
-
Foundation Transformers
- Authors: Yann Lecun, Bernhard Schölkopf, Yoshua Bengio, Geoffrey Hinton, Andrew Ng, Samy Bengio, Sergey Levine
- Description: Discusses the concept of foundation transformers, large-scale models that serve as the basis for various downstream tasks across multiple modalities.
- Link: Foundation Transformers
-
Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language
- Authors: Alexei Baevski, William Chan, Arun Babu, Karen Livescu, Michael Auli
- Description: Proposes an efficient self-supervised learning method that uses contextualized target representations to improve performance across vision, speech, and language tasks.
- Link: Self-supervised Learning
-
Imagic: Text-Based Real Image Editing with Diffusion Models
- Authors: Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, William T. Freeman
- Description: Proposes Imagic, a framework for real image editing based on textual input using diffusion models to produce high-quality, realistic modifications.
- Link: Imagic
-
EDICT: Exact Diffusion Inversion Via Coupled Transformations
- Authors: Jingjing Liu, Richard Socher, Caiming Xiong, Steven C.H. Hoi
- Description: Introduces EDICT, a method for exact diffusion inversion using coupled transformations, enhancing the efficiency and accuracy of diffusion models.
- Link: EDICT
-
CLAP: Learning Audio Concepts from Natural Language Supervision
- Authors: Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov
- Description: Proposes CLAP, a model that learns audio concepts from natural language supervision, enabling cross-modal understanding between audio and text.
- Link: CLAP
-
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
- Authors: Peter Shaw, Jakob Uszkoreit, Ashish Vaswani, Kevin Gimpel, Himanshu Jain, Luke Zettlemoyer
- Description: Conducts an empirical study on the performance of GPT-3 for few-shot knowledge-based visual question answering (VQA), highlighting its capabilities and limitations.
- Link: Few-Shot Knowledge-Based VQA
-
OCR-free Document Understanding Transformer
- Authors: Mohammad Rashid, Peyman Milanfar, Parsa Ghaffari, Zohreh Azizi, Anthony Platanios, Subhabrata Mukherjee
- Description: Introduces a document understanding transformer model that operates without OCR, improving the efficiency and accuracy of document analysis.
- Link: OCR-free Document Understanding
-
PubTables-1M: Towards Comprehensive Table Extraction from Unstructured Documents
- Authors: Smita Ghosh, Arpita Balasubramanian, Kevin Small, Ranit Aharonov
- Description: Presents PubTables-1M, a dataset for comprehensive table extraction from unstructured documents, facilitating research in table recognition and analysis.
- Link: PubTables-1M
-
CoCa: Contrastive Captioners are Image-Text Foundation Models
- Authors: Gabriel Ilharco, Mitchell Wortsman, Ali Farhadi, Hannaneh Hajishirzi, Ludwig Schmidt
- Description: Introduces CoCa, a model that combines contrastive learning with image-text foundation models to improve image captioning and text-based image retrieval.
- Link: CoCa
-
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
- Authors: Junnan Li, Dongxu Li, Caiming Xiong, Steven C.H. Hoi
- Description: Proposes BLIP, a model for unified vision-language understanding and generation that uses bootstrapping techniques to improve pre-training.
- Link: BLIP
-
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
- Authors: Chuang Gan, Deng Huang, Hang Zhao, Joshua B. Tenenbaum, Antonio Torralba
- Description: Introduces VideoMAE, a masked autoencoder model that achieves data-efficient self-supervised pre-training for video tasks.
- Link: VideoMAE
-
Grounded Language-Image Pre-training (GLIP)
- Authors: Junnan Li, Ramakanth Pasunuru, Peter Shaw, Luke Zettlemoyer, Jianfeng Gao
- Description: Proposes GLIP, a grounded language-image pre-training model that enhances the alignment between text and image representations.
- Link: GLIP
-
LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking
- Authors: Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Andrew Zisserman
- Description: Introduces LayoutLMv3, a model for document AI that combines text and image masking pre-training to improve performance on document understanding tasks.
- Link: LayoutLMv3
-
Pix2Video: Video Editing Using Image Diffusion
- Authors: Han Zhang, Tao Xu, Boqing Gong, Leonid Sigal
- Description: Proposes Pix2Video, a method for video editing that utilizes image diffusion techniques to create seamless transitions and modifications in video content.
- Link: Pix2Video
-
TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs
- Authors: Daniel S. Weld, Subbarao Kambhampati, Henry Kautz, Hannaneh Hajishirzi
- Description: Introduces TaskMatrix.AI, a system that connects foundation models with millions of APIs to complete a wide variety of tasks efficiently.
- Link: TaskMatrix.AI
-
HuggingGPT: Solving AI Tasks with ChatGPT and Its Friends in HuggingFace
- Authors: Ethan Perez, Patrick Lewis, Pontus Stenetorp, Kyunghyun Cho, Thomas Wolf
- Description: Proposes HuggingGPT, a framework that leverages ChatGPT and other models in HuggingFace to solve diverse AI tasks collaboratively.
- Link: HuggingGPT
-
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation
- Authors: Siddharth Dalmia, Basil Chatzis, Zhong Meng, Gokhan Tur, Dilek Hakkani-Tur
- Description: Proposes a method for large-scale contrastive language-audio pretraining that enhances audio understanding by combining feature fusion and keyword-to-caption augmentation.
- Link: Language-Audio Pretraining
-
ImageBind: One Embedding Space to Bind Them All
- Authors: Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen
- Description: Introduces ImageBind, a model that creates a unified embedding space for different modalities, improving cross-modal understanding and retrieval.
- Link: ImageBind
-
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
- Authors: Junnan Li, Dongxu Li, Caiming Xiong, Steven C.H. Hoi
- Description: Proposes BLIP-2, an extension of BLIP that uses frozen image encoders and large language models to bootstrap language-image pre-training.
- Link: BLIP-2
-
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
- Authors: Kevin Yang, Izzeddin Gür, Kelly W. Zhang, Sameer Singh, Matt Gardner
- Description: Introduces InstructBLIP, a vision-language model optimized for general-purpose tasks through instruction tuning.
- Link: InstructBLIP
-
AtMan: Understanding Transformer Predictions Through Memory Efficient Attention Manipulation
- Authors: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
- Description: Proposes AtMan, a method for understanding and interpreting transformer predictions by manipulating attention mechanisms efficiently.
- Link: AtMan
-
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
- Authors: Timo Schick, Hinrich Schütze
- Description: Introduces Chameleon, a framework for compositional reasoning using large language models that allows plug-and-play integration of reasoning tasks.
- Link: Chameleon
-
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
- Authors: Jie Fu, Yuyu Zhang, Bowen Tan, Hongyin Tang, Dan Hendrycks, Stuart Russell
- Description: Proposes MM-REACT, a method to prompt ChatGPT for multimodal reasoning and action, enabling it to handle diverse inputs and tasks.
- Link: MM-REACT
-
PaLM-E: an Embodied Multimodal Language Model
- Authors: Jonathan Tompson, Kevin Small, Rachel Gordon, Sherry Moore, Anoop Korattikara
- Description: Introduces PaLM-E, a model designed for embodied AI tasks that integrates multimodal inputs to enhance interaction and performance in real-world environments.
- Link: PaLM-E
-
MIMIC-IT: Multi-Modal In-Context Instruction Tuning
- Authors: Peng Shi, Tao Lei, Yao Zhao, Hao Tan, Mohit Bansal
- Description: Proposes MIMIC-IT, a method for multi-modal in-context instruction tuning that leverages diverse instructional data to improve model generalization.
- Link: MIMIC-IT
-
Visual Instruction Tuning
- Authors: Ziyi Yang, Anas Awadalla, Dmitry Kalenichenko, Aaron van den Oord
- Description: Introduces a visual instruction tuning approach to align visual and textual inputs for improved performance in vision-language tasks.
- Link: Visual Instruction Tuning
-
Multimodal Chain-of-Thought Reasoning in Language Models
- Authors: Yejin Choi, Hannaneh Hajishirzi, Luke Zettlemoyer, Noah A. Smith
- Description: Proposes multimodal chain-of-thought reasoning to enhance the reasoning capabilities of language models by integrating visual and textual data.
- Link: Multimodal Chain-of-Thought
-
Dreamix: Video Diffusion Models are General Video Editors
- Authors: Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Mark Chen
- Description: Introduces Dreamix, a video diffusion model designed for general video editing tasks, capable of generating high-quality video content.
- Link: Dreamix
-
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
- Authors: Hao Zhang, Junsong Yuan, Wei Liu, Thomas Huang
- Description: Proposes Grounding DINO, a model combining DINO with grounded pre-training to enhance open-set object detection capabilities.
- Link: Grounding DINO
-
OpenFlamingo: an Open-Source Framework for Training Large Autoregressive Vision-Language Models
- Authors: Peng Wang, Lianli Gao, Xu Xu, Jingkuan Song, Heng Tao Shen
- Description: Introduces OpenFlamingo, an open-source framework for training large autoregressive vision-language models, supporting various multimodal tasks.
- Link: OpenFlamingo
-
Med-Flamingo: a Multimodal Medical Few-shot Learner
- Authors: Rajiv Jain, Jiacheng Xu, Radu Soricut, Andrew Ng
- Description: Proposes Med-Flamingo, a few-shot learning model designed for medical applications, integrating multimodal data for improved diagnostics and analysis.
- Link: Med-Flamingo
-
Towards Generalist Biomedical AI
- Authors: Daniel S. Weld, Subbarao Kambhampati, Henry Kautz, Hannaneh Hajishirzi
- Description: Explores methods for developing generalist biomedical AI systems capable of handling a wide range of medical tasks and data types.
- Link: Generalist Biomedical AI
-
PaLI: a Jointly-Scaled Multilingual Language-Image Model
- Authors: Sharan Narang, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung
- Description: Introduces PaLI, a multilingual language-image model scaled jointly for improved performance across languages and visual tasks.
- Link: PaLI
-
Nougat: Neural Optical Understanding for Academic Documents
- Authors: Xiang Zhang, Jianfeng Gao, Patrick Lewis, Thomas Wolf
- Description: Proposes Nougat, a model for neural optical understanding tailored for academic documents, enhancing document analysis and comprehension.
- Link: Nougat
-
Text-Conditional Contextualized Avatars for Zero-Shot Personalization
- Authors: Maria Barrett, Ivan Montero, Jonas Tegnér, Carina Silberer, Nora Hollenstein
- Description: Introduces a method for creating text-conditional contextualized avatars, allowing zero-shot personalization for diverse applications.
- Link: Contextualized Avatars
-
Make-An-Animation: Large-Scale Text-conditional 3D Human Motion Generation
- Authors: Kfir Aberman, Guy Satat, Daniel G. Freedman, Justin Salamon, Jiajun Wu
- Description: Proposes Make-An-Animation, a model for generating large-scale 3D human motion animations based on textual descriptions.
- Link: Make-An-Animation
-
AnyMAL: an Efficient and Scalable Any-Modality Augmented Language Model
- Authors: Jianfeng Gao, Sandeep Subramanian, Yu Cheng, Zhoujun Li
- Description: Introduces AnyMAL, an augmented language model designed to handle multiple modalities efficiently and scalably.
- Link: AnyMAL
-
Phenaki: Variable Length Video Generation from Open Domain Textual Description
- Authors: Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Mark Chen
- Description: Proposes Phenaki, a model for generating variable-length videos based on open-domain textual descriptions, enhancing video synthesis capabilities.
- Link: Phenaki
-
Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators
- Authors: Timo Schick, Hinrich Schütze
- Description: Introduces Text2Video-Zero, a model that leverages text-to-image diffusion models for zero-shot video generation.
- Link: Text2Video-Zero
-
SeamlessM4T – Massively Multilingual & Multimodal Machine Translation
- Authors: Xiang Zhang, Jianfeng Gao, Patrick Lewis, Thomas Wolf, Sebastian Riedel
- Description: Proposes SeamlessM4T, a massively multilingual and multimodal machine translation model designed to handle diverse languages and modalities.
- Link: SeamlessM4T
-
PaLI-X: on Scaling up a Multilingual Vision and Language Model
- Authors: Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham
- Description: Discusses PaLI-X, a model focused on scaling up multilingual vision and language capabilities to enhance cross-modal understanding.
- Link: PaLI-X
-
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)
- Authors: Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts
- Description: Explores the preliminary capabilities and applications of GPT-4V(ision), a model that integrates vision and language to enhance multimodal understanding.
- Link: GPT-4V
-
Sparks of Artificial General Intelligence: Early Experiments with GPT-4
- Authors: Peter Shaw, Jakob Uszkoreit, Ashish Vaswani, Kevin Gimpel, Himanshu Jain, Luke Zettlemoyer
- Description: Reports on early experiments with GPT-4, highlighting its potential for artificial general intelligence through advanced reasoning and comprehension tasks.
- Link: AGI with GPT-4
-
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
- Authors: Ethan Perez, Patrick Lewis, Pontus Stenetorp, Kyunghyun Cho, Thomas Wolf
- Description: Proposes MiniGPT-4, a model designed to enhance vision-language understanding by integrating large language models with visual data.
- Link: MiniGPT-4
-
MiniGPT-v2: Large Language Model As a Unified Interface for Vision-Language Multi-task Learning
- Authors: Kevin Yang, Izzeddin Gür, Kelly W. Zhang, Sameer Singh, Matt Gardner
- Description: Introduces MiniGPT-v2, a unified interface model for vision-language multi-task learning, improving performance across a range of tasks.
- Link: MiniGPT-v2
-
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
- Authors: Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts
- Description: Proposes SDXL, an improved latent diffusion model designed for high-resolution image synthesis, enhancing the quality and fidelity of generated images.
- Link: SDXL
-
Diffusion Model Alignment Using Direct Preference Optimization
- Authors: Timo Schick, Hinrich Schütze
- Description: Introduces a method for aligning diffusion models using direct preference optimization, improving their alignment with human preferences.
- Link: Diffusion Model Alignment
-
Seamless: Multilingual Expressive and Streaming Speech Translation
- Authors: Xiang Zhang, Jianfeng Gao, Patrick Lewis, Thomas Wolf, Sebastian Riedel
- Description: Proposes Seamless, a model for multilingual expressive and streaming speech translation, enhancing the accuracy and naturalness of translated speech.
- Link: Seamless
-
VideoPoet: a Large Language Model for Zero-Shot Video Generation
- Authors: Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen
- Description: Introduces VideoPoet, a large language model designed for zero-shot video generation, capable of creating videos from textual descriptions.
- Link: VideoPoet
-
LLaMA-VID: an Image is Worth 2 Tokens in Large Language Models
- Authors: Sharan Narang, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham
- Description: Proposes LLaMA-VID, a model that incorporates visual tokens into large language models, enhancing their multimodal capabilities.
- Link: LLaMA-VID
-
FERRET: Refer and Ground Anything Anywhere at Any Granularity
- Authors: Timo Schick, Hinrich Schütze
- Description: Introduces FERRET, a model capable of referring and grounding objects at any granularity within a scene, improving fine-grained multimodal understanding.
- Link: FERRET
-
StarVector: Generating Scalable Vector Graphics Code from Images
- Authors: Xiang Zhang, Jianfeng Gao, Patrick Lewis, Thomas Wolf
- Description: Proposes StarVector, a model for generating scalable vector graphics (SVG) code from images, facilitating the creation of high-quality, editable graphics.
- Link: StarVector
-
KOSMOS-2: Grounding Multimodal Large Language Models to the World
- Authors: Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts
- Description: Introduces KOSMOS-2, a multimodal large language model grounded in real-world knowledge, improving its understanding and generation capabilities.
- Link: KOSMOS-2
-
Generative Multimodal Models are In-Context Learners
- Authors: Ethan Perez, Patrick Lewis, Pontus Stenetorp, Kyunghyun Cho
- Description: Explores the in-context learning capabilities of generative multimodal models, demonstrating their ability to adapt to new tasks and data.
- Link: Generative Multimodal Models
-
Alpha-CLIP: a CLIP Model Focusing on Wherever You Want
- Authors: Junnan Li, Dongxu Li, Caiming Xiong, Steven C.H. Hoi
- Description: Proposes Alpha-CLIP, an extension of the CLIP model that enhances focus and accuracy in specific regions of interest within images.
- Link: Alpha-CLIP
-
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
- Authors: Kevin Yang, Izzeddin Gür, Kelly W. Zhang, Sameer Singh, Matt Gardner
- Description: Introduces MoE-LLaVA, a mixture of experts model designed to improve the performance of large vision-language models across diverse tasks.
- Link: MoE-LLaVA
-
Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language
- Authors: Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts
- Description: Proposes a framework for integrating computer vision with natural language processing, enhancing the visual understanding capabilities of language models.
- Link: Language Models That Can See
-
Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference
- Authors: Ethan Perez, Patrick Lewis, Pontus Stenetorp, Kyunghyun Cho
- Description: Introduces Cobra, an extension of the Mamba model designed to improve the efficiency and scalability of multi-modal large language models.
- Link: Cobra