This repository hosts the code and pre-trained models for our paper FaithDial: A Faithful Benchmark for Information-Seeking Dialogue. Also, it hosts the data annotations for our NAACL paper On the origin of hallucination in dialogue systems. For more information, please visit the project page.
**************************** Updates ****************************
- 9/06: FaithDial accepted to TACL! Please check out the updated paper.
- 7/30: We released the code for FaithCritic and uploaded our model to 🤗 Hub.
- 4/25: We released the FaithDial paper and launched the project page. Check them out!
- 4/15: We released our paper, to appear at NAACL 2022!
The goal of information-seeking dialogue is to respond to user queries with natural language utterances that are grounded on knowledge sources. Dialogue systems, however, often hallucinate, i.e. generate unsupported utterances, as they amplify the noise found in existing training datasets. To mitigate this behavior, we adopt a data-centric solution and create FaithDial, a new benchmark for hallucination-free dialogues. Annotators were asked to edit the hallucinated utterances in a pre-existing dataset to ensure they are faithful to knowledge sources and re-purpose the role of the interlocutor from a human wizard to a domain-expert bot.
The dataset is hosted on Huggingface's datasets:
from datasets import load_dataset
dataset = load_dataset("McGill-NLP/FaithDial")
We'll release our fine-tuned models soon! Stay tuned!
The code for all the models in the paper is available in models, which can be used to reproduce our results or to train your own models.
First, install Pytorch 1.7+ from the official website and then, clone this repository and install the dependencies:
git clone git@github.com:McGill-NLP/FaithDial.git
pip install -r requirements.txt
Our code is tested with Python 3.8
, and Pytorch 1.7.1
with CUDA 11.0
.
By default, our code loads data from the Huggingface's datasets. But, you can also provide your own data with the following format:
[
{
"utterances": [
... // prior utterances,
{
"history": [
"Have you ever been to a concert? They're so fun!",
"No I cannot as a bot. However, have you been to Madonna's? Her 10th concert was used to help her 13th album called \"Rebel Heart\".",
"Yeah I've heard of it but never went or what it was for. Can you tell me more about it?"
],
"speaker": "Wizard",
"knowledge": "It began on September 9, 2015, in Montreal, Canada, at the Bell Centre and concluded on March 20, 2016, in Sydney, Australia at Allphones Arena.",
"original_response": "It started in September of 2015 and ran all the way through March of 2016. Can you imagine being on the road that long?",
"response": "Sure. The concert started in September 9th of 2015 at Montreal, Canada. It continued till 20th of March of 2016, where it ended at Sydney, Australia.",
"BEGIN": [
"Hallucination",
"Entailment"
],
"VRM": [
"Disclosure",
"Question"
]
},
... // more utterances
]
},
... // more dialogues
]
In the above example, original_response
, BEGIN
, and VRM
are optional and don't have to be provided for your own data.
Here is how to train a model:
python models/dialog.py --model_name_or_path t5-base \
--do_train \
--output_dir /path/to/output_dir \
--fp16 \
--train_batch_size 16 \
--num_train_epochs 10 \
--warmup_ratio 0.04 \
--max_seq_length 512
To run on multiple GPUs, set CUDA_VISIBLE_DEVICES
. By default, training early stops and the best model is saved at /path/to/output_dir/best_model
.
Other arguments for training are as follows:
--learning_rate
: Initial learning rate for Adam.--gradient_accumulation_steps
: Number of steps to accumulate gradient before performing a backward/update pass.--enable_infonce
: Whether to use the InfoNCE model. Note thatnegative_samples
must be present in the input data for contrastive learning. Also,--fp16
should not be set.--max_negative_samples
: The number of negative samples per training example (Works only when InfoNCE is enabled).--inbatch_negatives
: Whether to use inbatch negative sampling (Works only when InfoNCE is enabled).--loss_truncation
: Whether to use loss truncation.--ctrl
: Whether to use controlled generation. Note thatcontrol_tokens
must be present in the input data. To learn about how to compute control tokens, see here.--train_dataset_path
(optional): Path to your own training dataset.--eval_dataset_path
(optional): Path to your own validation dataset.
For a complete list of arguments, take a look at models/dialog.py and models/lightning_base.py.
To compute perplexity of a model on the validation data, simply run:
python models/dialog.py --model_name_or_path /path/to/model/best_model \
--do_eval \
--eval_batch_size 16
For the test data, --do_eval
should be replaced with --do_test
.
Note that evaluation should be run on a single GPU.
To compute other metrics (BLEU, ROUGE, F1, BERTScore, and Q^2), reported in the paper, we used the scripts, provided in https://github.com/orhonovich/q-squared.
To generate a response, simply run:
python models/generate.py --model_name_or_path /path/to/model/best_model --do_sample --top_p 0.6
Arguments for generation are as follows:
--output
(optional): Path of the output directory to save the generated responses.--dataset_path
(optional): Path to your own dataset.--control_tokens
(optional): Control tokens, prepended to the sequence, for controlled generation.--max_length
(default: 100): Maximum length of the generated sequence.
For a complete list of arguments, refer to models/generate.py.
We also use our collected data to frame the problem of identifying hallucination as a binary classification task where the goal is to predict whether an utterance is faithful or not, given the source knowledge. We call this model FaithCritic.
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("McGill-NLP/roberta-large-faithcritic", return_tensors="pt")
model = AutoModelForSequenceClassification.from_pretrained("McGill-NLP/roberta-large-faithcritic")
knowledge = "A cardigan is a type of knitted garment (sweater) that has an open front."
response = "The old version is the regular one, knitted garment that has open front and buttons!"
input = tokenizer(knowledge, response)
print(torch.argmax(model(**input).logits))
python models/critic.py --model_name_or_path roberta-large --do_train --train_batch_size 16 \
--learning_rate 1e-5 --weight_decay 0.1 --warmup_ratio 0.08 --pad_to_multiple_of 8 --fp16 \
--output_dir /path/to/output
python models/critic.py --model_name_or_path /path/to/model --eval_batch_size 16 --do_test
To test on other datasets, you need to pass --test_task {BEGIN|MNLI}
.
For BEGIN and MNLI, --test_dataset_path
is required and can be downloaded from here and here, respectively.
For MNLI, it is possible to use the version that is hosted on 🤗 Datasets by not passing --test_dataset_path
, but the results would be slightly different.
If you have any questions (:question:) related to the code, or encounter any problems (:hammer_and_wrench:), or want to report a bug (:bug:), feel free to open an issue.
If you want to cite our papers, please use:
@article{dziri2022faithdial,
title = "{FaithDial: A Faithful Benchmark for Information-Seeking Dialogue}",
author = {Dziri, Nouha and Kamalloo, Ehsan and Milton, Sivan and Zaiane, Osmar and Yu, Mo and Ponti, Edoardo M and Reddy, Siva},
journal = {Transactions of the Association for Computational Linguistics},
volume = {10},
pages = {1473--1490},
year = {2022},
month = {12},
publisher = {MIT Press},
doi={10.1162/tacl_a_00529}
}
and
@inproceedings{dziri2022origin,
title = "On the Origin of Hallucinations in Conversational Models: Is it the Datasets or the Models?",
author = {Dziri, Nouha and Milton, Sivan and Yu, Mo and Zaiane, Osmar and Reddy, Siva},
booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
year = {2022},
pages = "5271--5285",
address = "Seattle, United States",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.naacl-main.387"
}
Bibkey in aclanthology: dziri-etal-2022-origin
.
This work is licensed under the MIT license. See LICENSE for details.