Papers, codes and resources about multi-modal dialogue, including methods, datasets and related metrics.
We split the multi-modal dialogue task to Visual-Grounded Dialogue (VGD, including Visual QA or VQA), Visual Question Generation, Multimodal Conversation (MMC) and Visual Navigation (VN).
We roughly split the learning paradigm of different methods (if available) as: Fusion-Based (FB) and Attention-Based (AB).
- Fusion-based (FB): Simple concatenation of multi-modal information at the model input.
- Attention-Based (AB): Co-attention between different modalities to learn their relations.
Visual grounded dialogue considers only one image for one dialogue session. The whole session is constrained to this given image. It is also know as Visual Dialog task.
Multi-modal conversation (MMC) aims at conducting conversations with multiple images. Models should understand multiple images and/or generate multi-modal responses during conversation.
Title | Dataset Used | Publisher | Code | Class |
---|---|---|---|---|
Multimodal Dialogue Response Generation | PhotoChat; Reddit; YFCC100M | ACL 2022 | CODE | FB |
Constructing Multi-Modal Dialogue Dataset by Replacing Text with Semantically Relevant Images | DailyDialog; Empathetic Dialog; Persona Chat; MS-COCO; Flicker30K | ACL 2021 | CODE | FB |
Towards Enriching Responses with Crowd-sourced Knowledge for Task-oriented Dialogue | MMConv. | MuCAI 2021 | CODE | FB |
Multimodal Dialog System: Generating Responses via Adaptive Decoders | MMD | MM 2019 | CODE | FB |
Multimodal Dialog Systems via Capturing Context-aware Dependencies of Semantic Elements | MMD | MM 2020 | CODE | FB |
Multimodal Dialog System: Relational Graph-based Context-aware Question Understanding | MMD | MM 2021 | CODE | FB |
User Attention-guided Multimodal Dialog Systems | MMD | SIGIR 2019 | CODE | FB |
Text is NOT Enough: Integrating Visual Impressions into Open-domain Dialogue Generation | DailyDialog; Flickr30K; PersonaChat | MM 2021 | CODE | AB |
Question generation task generates questions instead of responses based on given images. It is similar to visual-grounded dialogue.
Title | Dataset Used | Publisher | Code |
---|---|---|---|
Category-Based Strategy-Driven Question Generator for Visual Dialogue | GuessingWhat?! | CCL 2021 | CODE |
Visual Dialogue State Tracking for Question Generation | GuessingWhat?! | AAAI 2020 | CODE |
Answer-Driven Visual State Estimator for Goal-Oriented Visual Dialogue | GuessWhat?!; MS-COCO | MM 2020 | CODE |
Goal-Oriented Visual Question Generation via Intermediate Re- wards | GuessWhat?! | ECCV 2018 | CODE |
Learning goal-oriented visual dialog via tempered policy gradient | GuessWhat?! | SLT 2018 | CODE |
Information maximizing visual question generation | VQG | ICCV 2019 | CODE |
Visual navigation focuses on guiding users to their destination from their starter points given surrounding information. Here we mainly collect methods that involve conversational guidance in language.
A summary paper of visual dialogue metrics: A Revised Generative Evaluation of Visual Dialogue, CODE.
We split the related metrics to Rank-based and Generiate-based
- Rank-based: measures the quality of responses that retrieve from response candidates.
- Generate-based: measures the quality of responses that generate from the model.
Metrics | Better indicator | Explanation |
---|---|---|
Mean | mean rank of ground truth response in candidates | |
R@k | ratio of ground truth response in the top-k ranked responses | |
Mean Reciprocal Rank (MRR) | mean reciprocal rank of the ground truth response in the ranked responses | |
Normalized Discounted Cumulative Gain@k(NDCG@k) | relevance score list, assigns 0-1 for 100 candidates responses based on semantic similarity with the ground truth responses |
Metrics | Better indicator |
---|---|
BLEU-k | |
ROUGE-k/L | |
Meteor | |
CIDEr | |
Embedding Average | |
Embedding Extrema | |
Embedding Greedy |