Can ChatGPT Replace Traditional KBQA Models? An In-depth Analysis of the Question Answering Performance of the GPT LLM Family
If you find our code, data and paper useful, please kindly cite:
@inproceedings{TanMLLHCQ23,
title={Can ChatGPT Replace Traditional {KBQA} Models? An In-Depth Analysis of the Question Answering Performance of the {GPT} {LLM} Family},
author={Tan, Yiming and Min, Dehai and Li, Yu and Li, Wenbo and Hu, Nan and Chen, Yongrui and Qi, Guilin},
journal={The Semantic Web - {ISWC} 2023 - 22nd International Semantic Web Conference, Athens, Greece, November 6-10, 2023, Proceedings, Part {I}},
volume={14265},
pages ={348--367},
publisher={Springer},
year={2023}
}
A framework for detailed evaluation of the ability of GPT family and popular open-source large language models to answer complex questions using their inherent knowledge.
We have released the answers of chatgpt and other models to a total of 194,782 questions across 8 datasets (6 English datasets, 2 multilingual datasets), including multiple languages in Datasets we publish.
This repository is mainly contributed by Yiming Tan, Dehai Min, Yu Li, Wenbo Li, Nan Hu, Guilin Qi.
🔥🎉 We have released the answers of chatgpt and other models to a total of 194,782 questions across 8 datasets, including multiple languages in Datasets we publish.
To evaluate the ability of large language models such as ChatGPT to answer KB-based complex question answering (KB-based CQA), we proposed an evaluation framework:
First, we designed multiple labels to describe the answer type, reasoning operations required to answer the question, and language type of each test question.
Second, based on the black-box testing specifications proposed by Microsoft's CheckList, we designed an evaluation method that introduces CoT prompts to measure the reasoning capability and reliability of large language models when answering complex questions.
Our evaluation used eight real and complex QA datasets, including six English datasets and two multilingual datasets, to further analyze the potential impact of language type on the performance of large language models.
We compared the evaluation results of FLAN-T5, ChatGPT, GPT3, GPT3.5 series, and GPT-4 to determine the iterative benefits of different models within the GPT family and some commonalities between GPT family models and other LLMs.
The following table shows the performance of evaluated models on different datasets, and we also compared them with the current SOTA traditional KBQA models (fine-tuned (FT) and zero-shot (ZS)).
(When evaluating answers, we only consider two situations: answering correctly or answering incorrectly. Therefore, our Acc score is the same as our F1 score.)
We classify the answers of these models for the KBQA dataset according to dataset and model, and release them in this folder.
answers_from_LLMs : The response(answers) of these models(Chatgpt, Gpt3/Gpt3.5, FLAN-T5, GPT-4) to the KBQA datasets mentioned in Datasets we use.
Datasets | Size | Col.Size | Lang |
---|---|---|---|
KQAPro | 117970 | 106173 | EN |
LC-quad2.0 | 26975 | 26975 | EN |
WQSP | 4737 | 4700 | EN |
CWQ | 31158 | 31158 | EN |
GrailQA | 64331 | 6763 | EN |
GraphQuestions | 4776 | 4776 | EN |
QALD-9 | 6045 | 6045 | Mul |
MKQA | 260000 | 6144 | Mul |
Total Collected | 194782 |
datasets : We have processed the 8 datasets mentioned in Datasets we use into a unified format and released them in this folder. The datasets in the unified format include the following items: question_id, question, ground_truth, SPARQL, and our added labels. Additionally, we have generated alias dictionaries from Wikipedia for the ground truth, which we can use during the evaluation.
To highlight the complexity of the testing questions and the breadth of the testing dataset, after careful consideration, we selected six representative English monolingual KBQA datasets and two multilingual KBQA datasets for evaluation.
💥 Please note : The links in the Source
section below refer to the original datasets as published by their respective authors. For our experiments in this paper, we have processed these datasets accordingly, including random sampling and formatting. Please download the datasets used in our experiments from this folder: datasets.
Monolingual datasets | Source | Paper |
---|---|---|
WebQuestionSP(WQSP) | Download_url | Paper_url |
ComplexWebQuestion(CWQ) | Download_url | Paper_url |
GraphQuestions | Download_url | Paper_url |
GrailQA | Download_url | Paper_url |
KQApro | Download_url | Paper_url |
LC-quad2.0 | Download_url | Paper_url |
Multilingual dataset
Multilingual datasets | Source | Paper |
---|---|---|
QALD-9 | Download_url | Paper_url |
MKQA | Download_url | Paper_url |
We have uploaded our code for using ChatGPT to collect answers to questions in datasets. The code uses the official OpenAI's API. If you want to learn more about APl, see OpenAI's official website for more information: https://platform.openai.com/docs/guides/chat.
We have released the code for evaluating the EM score(model performance) of the model's answers in our paper, located in the evaluation_code. We believe it is a good reference for evaluating the correctness of generative language models in question-answering tasks.
The data for Invariance test (INV) and Directional Expectation test (DIR) are published at: INV_and_DIR
We assess the LLM's ability to handle each feature in the KB-based CQA scenario through the Minimal Functional Test (MFT); we classify the answer types into 9 categories, respectively Mixed fact (MISC);Reason (WHY);Location (LOC);Time (DATE/TIME);Character (PER);Yes or no (Boolean);Number (NUM);Organization (ORG);Unable to answer (UNA)
At the same time, we divide the labels of "reasoning type" into eight categories, which are: SetOperation; Filtering; Counting; The most valuable; Sort; Single-hop; Multi-hop; Star-shape.
We also take into account the "language type" label that may have an impact on model performance: de; ru; pt; hi_IN; en; Fa; it; fr; ro; es; nl; pt_BR; zh cn.
We have designed 2 methods to generate test cases for INV:
- Randomly introducing spelling errors into the original problem sentence.
- Generating a question that is semantically equivalent (paraphrased) to the original problem sentence.
Then, we evaluate the invariance of the LLMs by checking the consistency of their correctness in the outputs generated from 3 inputs (The above two methods generate the question sentence and the original question).
We created three modes to generate DIR test cases:
- replacing reasoning operation-related phrases in questions to observe LLMs' output.
- adding prompts for answer types to test LLMs' control over output.
- using multi-round questioning inspired by CoT to observe LLMs' sensitivity and effectiveness with CoT prompts for different question types.