Dialog Evaluation Paper List

We collect and classify multiple evaluation methods for different dialog tasks, start from 2012.

Tasks include:

Open-domain Dialog
Task-oriented Dialog
Dialog Summarization
Dialog Management
Dialog State Track
Dialog Policy
Knowledge-ground Dialog
Conversational Search
Conversational Recommendation
Others

Modals include:

Text-based Dialog
Speech-based Dialog
Visual-based Dialog
MultiModal-based Dialog

Survey

Survey on evaluation methods for dialogue systems. Artificial Intelligence Review2021
Conversational Recommendation: Formulation, Methods, and Evaluation. SIGIR2020
A review of evaluation techniques for social dialogue systems. ISIAA@ICMI2017
A Comprehensive Assessment of Dialog Evaluation Metrics. CoRR2021
How to Evaluate Your Dialogue Models: A Review of Approaches. CoRR2021

2022

MDD-Eval: Self-Training on Augmented Data for Multi-Domain Dialogue Evaluation. AAAI
Towards Fair Evaluation of Dialogue State Tracking by Flexible Incorporation of Turn-level Performances. ACL
What is wrong with you?: Leveraging User Sentiment for Automatic Dialog Evaluation. ACL
DEAM: Dialogue Coherence Evaluation using AMR-based Semantic Manipulations. ACL
Probing the Robustness of Trained Metrics for Conversational Dialogue Systems. ACL
Mismatch between Multi-turn Dialogue and its Evaluation Metric in Dialogue State Tracking. ACL
Human Evaluation of Conversations is an Open Problem: comparing the sensitivity of various methods for evaluating dialogue agents. ConvAI@ACL
Relevance in Dialogue: Is Less More? An Empirical Comparison of Existing Metrics, and a Novel Simple Metric. ConvAI@ACL 2022
Doctor XAvIer: Explainable Diagnosis on Physician-Patient Dialogues and XAI Evaluation. BioNLP@ACL
Open-Domain Dialog Evaluation Using Follow-Ups Likelihood. COLING
Does GPT-3 Generate Empathetic Dialogues? A Novel In-Context Example Selection Method and Automatic Evaluation Metric for Empathetic Dialogue Generation. COLING
SelF-Eval: Self-supervised Fine-grained Dialogue Evaluation. COLING
Integrating Pretrained Language Model for Dialogue Policy Evaluation. ICASSP
A Dependency-Aware Utterances Permutation Strategy to Improve Conversational Evaluation. ECIR
DialSummEval: Revisiting Summarization Evaluation for Dialogues. NAACL
Explaining Dialogue Evaluation Metrics using Adversarial Behavioral Analysis. NAACL
Long-term Control for Dialogue Generation: Methods and Evaluation. NAACL
Generate, Evaluate, and Select: A Dialogue System with a Response Evaluator for Diversity-Aware Response Generation. NAACL-HLT (Student Research Workshop)
MultiWOZ 2.4: A Multi-Domain Task-Oriented Dialogue Dataset with Essential Annotation Corrections to Improve State Tracking Evaluation. SIGDIAL
A Systematic Evaluation of Response Selection for Open Domain Dialogue. SIGDIAL
Dialogue Evaluation with Offline Reinforcement Learning. SIGDIAL
Evaluating N-best Calibration of Natural Language Understanding for Dialogue Systems. SIGDIAL
Evaluation of Off-the-shelf Speech Recognizers on Different Accents in a Dialogue Domain. LREC
Evaluating the Effects of Embedding with Speaker Identity Information in Dialogue Summarization. LREC
Design and Evaluation of the Corpus of Everyday Japanese Conversation. LREC
Evaluating Gender Bias in Film Dialogue. NLDB
Statistical and clinical utility of multimodal dialogue-based speech and facial metrics for Parkinson's disease assessment. INTERSPEECH
Which Model is Best: Comparing Methods and Metrics for Automatic Laughter Detection in a Naturalistic Conversational Dataset. INTERSPEECH
Evaluation of call centre conversations based on a high-level symbolic representation. INTERSPEECH
Evaluating Attribution in Dialogue Systems: The BEGIN Benchmark. TACL
A Review of Evaluation Practices of Gesture Generation in Embodied Conversational Agents. IEEE Trans. Hum. Mach. Syst.
Does Social Presence Increase Perceived Competence?: Evaluating Conversational Agents in Advice Giving Through a Video-Based Survey. Proc. ACM Hum. Comput. Interact
"I don't know what you mean by 'I am anxious'": A New Method for Evaluating Conversational Agent Responses to Standardized Mental Health Inputs for Anxiety and Depression. TIIS
Ditch the Gold Standard: Re-evaluating Conversational Question Answering. ACL
Evaluating the Cranfield Paradigm for Conversational Search Systems. ICTIR
Evaluating Mixed-initiative Conversational Search Systems via User Simulation. WSDM
FlowEval: A Consensus-Based Dialogue Evaluation Framework Using Segment Act Flows. CoRR
Report from the NSF Future Directions Workshop on Automatic Evaluation of Dialog: Research Directions and Challenges. CoRR
MME-CRS: Multi-Metric Evaluation Based on Correlation Re-Scaling for Evaluating Open-Domain Dialogue. CoRR
Interactive Evaluation of Dialog Track at DSTC9. CoRR
EnDex: Evaluation of Dialogue Engagingness at Scale. CoRR
FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation. CoRR
End-to-End Evaluation of a Spoken Dialogue System for Learning Basic Mathematics. CoRR
Bipartite-play Dialogue Collection for Practical Automatic Evaluation of Dialogue Systems. CoRR
CGoDial: A Large-Scale Benchmark for Chinese Goal-oriented Dialog Evaluation. CoRR
Analyzing and Evaluating Faithfulness in Dialogue Summarization. CoRR
ED-FAITH: Evaluating Dialogue Summarization on Faithfulness. CoRR
INFACT: An Online Human Evaluation Framework for Conversational Recommendation. CoRR
Evaluation of Automated Speech Recognition Systems for Conversational Speech: A Linguistic Perspective. CoRR
Evaluating Data-Driven Co-Speech Gestures of Embodied Conversational Agents through Real-Time Interaction. CoRR
Evaluating Conversational Recommender Systems.CoRR

2021

Conversation Graph: Data Augmentation, Training and Evaluation for Non-Deterministic Dialogue Management. TACL
Meta-evaluation of Conversational Search Evaluation Metrics. TIS
D-Score: Holistic Dialogue Evaluation Without Reference. TASLP
How Am I Doing?: Evaluating Conversational Search Systems Offline. TIS
Preserving Conversations with Contemporary Holocaust Witnesses: Evaluation of Interactions with a Digital 3D Testimony. CHI Extended Abstracts
Heuristic Evaluation of Conversational Agents. CHI
"How Robust R U?": Evaluating Task-Oriented Dialogue Systems on Spoken Conversations. ASRU
POSSCORE: A Simple Yet Effective Evaluation of Conversational Search with Part of Speech Labelling. CIKM
Evaluating Human-AI Hybrid Conversational Systems with Chatbot Message Suggestions. CIKM
Enhancing the Open-Domain Dialogue Evaluation in Latent Space. ACL Findings
RedditBias: A Real-World Resource for Bias Evaluation and Debiasing of Conversational Language Models. ACL
Towards a more Robust Evaluation for Conversational Question Answering. ACL
Improving Automated Evaluation of Open Domain Dialog via Diverse Reference Augmentation. ACL Findings
REAM$\sharp$: An Enhancement Approach to Reference-based Evaluation Metrics for Open-domain Dialog Generation. ACL Findings
What Did You Refer to? Evaluating Co-References in Dialogue. ACL Findings
RADDLE: An Evaluation Benchmark and Analysis Platform for Robust Task-oriented Dialog Systems. ACL
A Human-machine Collaborative Framework for Evaluating Malevolence in Dialogues. ACL
LEGOEval: An Open-Source Toolkit for Dialogue System Evaluation via Crowdsourcing. ACL demo
Towards Quantifiable Dialogue Coherence Evaluation。 ACL
DynaEval: Unifying Turn and Dialogue Level Evaluation. ACL
Hierarchical Dependence-aware Evaluation Measures for Conversational Search. SIGIR
The Interplay of Task Success and Dialogue Quality: An in-depth Evaluation in Task-Oriented Visual Dialogues. EACL
Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach. EMNLP
$Q2$: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering. EMNLP
NDH-Full: Learning and Evaluating Navigational Agents on Full-Length Dialogue. EMNLP
Building and Evaluating Open-Domain Dialogue Corpora with Clarifying Questions. EMNLP
Large-Scale Quantitative Evaluation of Dialogue Agents' Response Strategies against Offensive Users. SIGDIAL
How "open" are the conversations with open-domain chatbots? A proposal for Speech Event based evaluation. SIGDIAL
Contrastive Response Pairs for Automatic Evaluation of Non-task-oriented Neural Conversational Models. SIGDIAL
Simulating User Satisfaction for the Evaluation of Task-oriented Dialogue Systems. SIGIR
Non-goal oriented dialogue agents: state of the art, dataset, and evaluation. Artif. Intell. Rev
An Evaluation of Chinese Human-Computer Dialogue Technology. Data Intell.
CoCo: Controllable Counterfactuals for Evaluating Dialogue State Trackers. ICLR
WeChat AI's Submission for DSTC9 Interactive Dialogue Evaluation Track. CoRR
On the Use of Linguistic Features for the Evaluation of Generative Dialogue Systems. CoRR
Towards Quantifiable Dialogue Coherence Evaluation. CoRR
Improving Computer Generated Dialog with Auxiliary Loss Functions and Custom Evaluation Metrics. CoRR
Naturalness Evaluation of Natural Language Generation in Task-oriented Dialogues using BERT. CoRR
Investigating the Impact of Pre-trained Language Models on Dialog Evaluation. CoRR
Automatic Evaluation and Moderation of Open-domain Dialogue Systems. CoRR
User Response and Sentiment Prediction for Automatic Dialogue Evaluation. CoRR
Evaluate On-the-job Learning Dialogue Systems and a Case Study for Natural Language Understanding. CoRR
Evaluating Predictive Uncertainty under Distributional Shift on Dialogue Dataset. CoRR
Evaluating Pretrained Transformer Models for Entity Linking in Task-Oriented Dialog. CoRR
A Conceptual Framework for Implicit Evaluation of Conversational Search Interfaces. CoRR
An Automated Quality Evaluation Framework of Psychotherapy Conversations with Local Quality Estimates. CoRR
Is my agent good enough? Evaluating Embodied Conversational Agents with Long and Short-term interactions. CoRR
Evaluating Trust in the Context of Conversational Information Systems for new users of the Internet. CoRR

2020

Improving Dialog Evaluation with a Multi-reference Adversarial Dataset and Large Scale Pretraining. TACL
PONE: A Novel Automatic Evaluation Metric for Open-domain Generative Dialogue Systems. TIS
How to Evaluate Single-Round Dialogues Like Humans: An Information-Oriented Metric. TASLP
Predictive Engagement: An Efficient Metric for Automatic Evaluation of Open-Domain Dialogue Systems. AAAI
Studying the Effects of Cognitive Biases in Evaluation of Conversational Agents. CHI
A Conversational Agent to Improve Response Quality in Course Evaluations. CHI Extended Abstracts
Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for Automatic Dialog Evaluation. ACL
USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation. ACL
Towards Holistic and Automatic Evaluation of Open-Domain Dialogue Generation. ACL
Can You Put it All Together: Evaluating Conversational Agents' Ability to Blend Skills. ACL
Learning an Unreferenced Metric for Online Dialogue Evaluation. ACL
Evaluating Dialogue Generation Systems via Response Selection. ACL
Designing Precise and Robust Dialogue Response Evaluators. ACL
uBLEU: Uncertainty-Aware Automatic Evaluation Method for Open-Domain Dialogue Systems. ACL student
ConvLab-2: An Open-Source Toolkit for Building, Evaluating, and Diagnosing Dialogue Systems. ACL demo
Voiceai Systems to NIST Sre19 Evaluation: Robust Speaker Recognition on Conversational Telephone Speech. ICASSP
Semantic Diversity for Natural Language Understanding Evaluation in Dialog Systems. COLING Industry
Evaluating Cross-Lingual Transfer Learning Approaches in Multilingual Conversational Agent Models. COLING (Industry)
Language Model Transformers as Evaluators for Open-domain Dialogues. COLING
Deconstruct to Reconstruct a Configurable Evaluation Metric for Open-Domain Dialogue Systems. COLING
A Comprehensive Evaluation of Incremental Speech Recognition and Diarization for Conversational AI. COLING
Spot The Bot: A Robust and Efficient Framework for the Evaluation of Conversational Dialogue Systems. EMNLP
GRADE: Automatic Graph-Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems. EMNLP
Interactive Evaluation of Conversational Agents: Reflections on the Impact of Search Task Design. ICTIR
Treating Dialogue Quality Evaluation as an Anomaly Detection Problem. LREC
Evaluation of Off-the-shelf Speech Recognizers Across Diverse Dialogue Domains. LREC
Evaluation of Argument Search Approaches in the Context of Argumentative Dialogue Systems. LREC
Towards Unified Dialogue System Evaluation: A Comprehensive Analysis of Current Evaluation Protocols. SIGDIAL
Unsupervised Evaluation of Interactive Dialog with DialoGPT. SIGDIAL
Is Your Goal-Oriented Dialog Model Performing Really Well? Empirical Analysis of System-wise Evaluation. SIGDIAL
FinChat: Corpus and Evaluation Setup for Finnish Chat Conversations on Everyday Topics. INTERSPEECH
Challenges in the Evaluation of Conversational Search Systems. Converse@KDD
Evaluating Conversational Recommender Systems via User Simulation. KDD
A Revised Generative Evaluation of Visual Dialogue. CoRR
How To Evaluate Your Dialogue System: Probe Tasks as an Alternative for Token-level Evaluation Metrics. CoRR
Turn-level Dialog Evaluation with Dialog-level Weak Signals for Bot-Human Hybrid Customer Service Systems. CoRR
Submitting surveys via a conversational interface: an evaluation of user acceptance and approach effectiveness. CoRR
An Evaluation Protocol for Generative Conversational Systems. CoRR

2019

SSA: A More Humanized Automatic Evaluation Method for Open Dialogue Generation. IJCNN
Re-Evaluating ADEM: A Deeper Look at Scoring Dialogue Responses. AAAI
Probabilistic-Logic Bots for Efficient Evaluation of Business Rules Using Conversational Interfaces. AAAI
Towards a Metric for Automated Conversational Dialogue System Evaluation and Improvement. INLG
Importance of Search and Evaluation Strategies in Neural Dialogue Modeling. INLG
Towards Best Experiment Design for Evaluating Dialogue System Output. INLG
Towards Coherent and Engaging Spoken Dialog Response Generation Using Automatic Conversation Evaluators. INLG
Are the Tools up to the Task? an Evaluation of Commercial Dialog Tools in Developing Conversational Enterprise-grade Dialog Systems. NAACL-HLT
Evaluating and Enhancing the Robustness of Dialogue Systems: A Case Study on a Negotiation Agent. NAACL-HLT
Evaluating Coherence in Dialogue Systems using Entailment. NAACL-HLT
Evaluating and Enhancing the Robustness of Retrieval-Based Dialogue Systems with Adversarial Examples. NLPCC
Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems. NeurIPS
Investigating Evaluation of Open-Domain Dialogue Systems With Human Generated Multiple References. SIGdial`
A Crowd-based Evaluation of Abuse Response Strategies in Conversational Agents. SIGdial
User Evaluation of a Multi-dimensional Statistical Dialogue System. SIGdial
Automatic evaluation of end-to-end dialog systems with adequacy-fluency metrics. Comput. Speech Lang.
MusicBot: Evaluating Critiquing-Based Music Recommenders with Conversational Interaction. CIKM
Better Automatic Evaluation of Open-Domain Dialogue Systems with Contextualized Embeddings. CoRR
Domain-Independent turn-level Dialogue Quality Evaluation via User Satisfaction Estimation. CoRR
ACUTE-EVAL: Improved Dialogue Evaluation with Optimized Questions and Multi-turn Comparisons. CoRR
How to Evaluate the Next System: Automatic Dialogue Evaluation from the Perspective of Continual Learning. CoRR
Evaluating Older Users' Experiences with Commercial Dialogue Systems: Implications for Future Design and Development.CoRR
Short Text Conversation Based on Deep Neural Network and Analysis on Evaluation Measure. CoRR
SIMMC: Situated Interactive Multi-Modal Conversational Data Collection And Evaluation Platform. CoRR
Multi-domain Conversation Quality Evaluation via User Satisfaction Estimation. CoRR

2018

RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems. AAAI
Evaluation of Real-time Deep Learning Turn-taking Models for Multiple Dialogue Scenarios. ICMI
One "Ruler" for All Languages: Multi-Lingual Dialogue Evaluation with Adversarial Multi-Task Learning. IJCAI
Evaluating and Complementing Vision-to-Language Technology for People who are Blind with Conversational Crowdsourcing. IJCAI
Adaboost with Auto-Evaluation for Conversational Models. IJCAI
Towards a Structured Evaluation of Improv-bots: Improvisational Theatre as a Non-goal-driven Dialogue System. LaCATODA@IJCAI
Expert Evaluation of a Spoken Dialogue System in a Clinical Operating Room. LREC
EuroGames16: Evaluating Change Detection in Online Conversation. LREC
LSDSCC: a Large Scale Domain-Specific Conversational Corpus for Response Generation with Diversity Oriented Evaluation Metrics. NAACL-HLT
Empirical Evaluation of Character-Based Model on Neural Named-Entity Recognition in Indonesian Conversational Texts. NUT@EMNLP
A Methodology for Evaluating Interaction Strategies of Task-Oriented Conversational Agents. SCAI@EMNLP
Topic-based Evaluation for Conversational Bots. CoRR
On Evaluating and Comparing Conversational Agents. CoRR

2017

Adversarial evaluation for open-domain dialogue generation. SIGDIAL Conference
Evaluating Natural Language Understanding Services for Conversational Question Answering Systems. SIGDIAL Conference
Generating and Evaluating Summaries for Partial Email Threads: Conversational Bayesian Surprise and Silver Standards. SIGDIAL Conference
Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses. ACL
Evaluating Persuasion Strategies and Deep Reinforcement Learning methods for Negotiation Dialogue agents. EACL
Sherlock: Experimental Evaluation of a Conversational Agent for Mobile Information Tasks. IEEE Trans. Hum. Mach. Syst
Adversarial Evaluation of Dialogue Models. CoRR
The First Evaluation of Chinese Human-Computer Dialogue Technology. CoRR
Relevance of Unsupervised Metrics in Task-Oriented Dialogue for Evaluating Natural Language Generation. CoRR
Evaluating Quality of Chatbots and Intelligent Conversational Agents. CoRR
Perspectives for Evaluating Conversational AI. CoRR
Evaluating Visual Conversational Agents via Cooperative Human-AI Games. CoRR

2016

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation. EMNLP
Evaluation Dataset (DT-Grade) and Word Weighting Approach towards Constructed Short Answers Assessment in Tutorial Dialogue Context. BEA@NAACL-HLT
On the Evaluation of Dialogue Systems with Next Utterance Classification. SIGDIAL Conference
The dialogue breakdown detection challenge: Task description, datasets, and evaluation metrics. LREC
Automatic creation of scenarios for evaluating spoken dialogue systems via user-simulation. Knowl. Based Syst
Evaluating Prerequisite Qualities for Learning End-to-End Dialog Systems. ICLR (Poster)
Interactive Topic Modeling for Exploring Asynchronous Online Conversations: Design and Evaluation of ConVisIT. TIS

2015

Evaluation of Crowdsourced User Input Data for Spoken Dialog Systems. SIGDIAL Conference
Evaluating Spoken Dialogue Processing for Time-Offset Interaction. SIGDIAL Conference
Query Refinement Using Conversational Context: A Method and an Evaluation Resource. NLDB

2014

Extrinsic Evaluation of Dialog State Tracking and Predictive Metrics for Dialog Policy Optimization. SIGDIAL Conference
Evaluating a Spoken Dialogue System that Detects and Adapts to User Affective States. SIGDIAL Conference
Evaluating coherence in open domain conversational systems. INTERSPEECH
Modeling and evaluating dialog success in the LAST MINUTE corpus. LREC
Japanese conversation corpus for training and evaluation of backchannel prediction model. LREC
Network assisted rate adaptation for conversational video over LTE, concept and performance evaluation. CSWS@SIGCOMM
Evaluation of a Conversation Management Toolkit for Multi Agent Programming. CoRR

2013

Development and evaluation of spoken dialog systems with one or two agents. INTERSPEECH
Affective evaluation of multimodal dialogue games for preschoolers using physiological signals. INTERSPEECH
Evaluating spoken dialogue models under the interactive pattern recognition framework. INTERSPEECH
Evaluating an adaptive dialog system for the public. INTERSPEECH
How Was Your Day? Evaluating a Conversational Companion. TAC
In-Context Evaluation of Unsupervised Dialogue Act Models for Tutorial Dialogue. SIGDIAL Conference
Evaluation of Speech Dialog Strategies for Internet Applications in the Car. SIGDIAL Conference
Evaluating State Representations for Reinforcement Learning of Turn-Taking Policies in Tutorial Dialogue. SIGDIAL Conference
Evaluating a City Exploration Dialogue System with Integrated Question-Answering and Pedestrian Navigation. ACL
Implementation and evaluation of a multimodal addressee identification mechanism for multiparty conversation systems. ICMI
Iterative Development and Evaluation of a Social Conversational Agent. IJCNLP
An Automatic Dialog Simulation Technique to Develop and Evaluate Interactive Conversational Agents. Appl. Artif. Intell

2012

Practical Evaluation of Human and Synthesized Speech for Virtual Human Dialogue Systems. LREC
Evaluation of Online Dialogue Policy Learning Techniques. LREC
Resource Evaluation for Usable Speech Interfaces: Utilizing Human-Human Dialogue. LREC
Evaluation of the KomParse Conversational Non-Player Characters in a Commercial Virtual World. LREC
Evaluating expressive speech synthesis from audiobook corpora for conversational phrases. LREC
Developing and evaluating an emergency scenario dialogue corpus. LREC
Intrinsic and Extrinsic Evaluation of an Automatic User Disengagement Detector for an Uncertainty-Adaptive Spoken Dialogue System. HLT-NAACL
Position Paper: Towards Standardized Metrics and Tools for Spoken and Multimodal Dialog System Evaluation. SDCTD@NAACL-HLT
An End-to-End Evaluation of Two Situated Dialog Systems. SIGDIAL Conference
Evaluating language understanding accuracy with respect to objective outcomes in a dialogue system. EACL
Topic identification based extrinsic evaluation of summarization techniques applied to conversational speech. ICASSP
Conversational evaluation of artificial bandwidth extension of telephone speech using a mobile handset. ICASSP
Synthesis and evaluation of conversational characteristics in HMM-based speech synthesis. Speech Commun.
Conversational Evaluation of Speech Bandwidth Extension Using a Mobile Handset. IEEE Signal Process. Lett.
Designing generalisation evaluation function through human-machine dialogue. CoRR

Contact

If you have any questions related to the repository or want to increase any work about dialog evaluation, feel free to open an issue or email Peiyuan Gong (pygongnlp@gmail.com).

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dialog Evaluation Paper List

Survey

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Contact

About

Releases

Packages

pygongnlp/dialog_evaluation_paper_list

Folders and files

Latest commit

History

Repository files navigation

Dialog Evaluation Paper List

Survey

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages