Complex-question-answering-evaluation-of-GPT-family

Can ChatGPT Replace Traditional KBQA Models? An In-depth Analysis of the Question Answering Performance of the GPT LLM Family

Citation

If you find our code, data and paper useful, please kindly cite:

@inproceedings{TanMLLHCQ23,
  title={Can ChatGPT Replace Traditional {KBQA} Models? An In-Depth Analysis of the Question Answering Performance of the {GPT} {LLM} Family},
  author={Tan, Yiming and Min, Dehai and Li, Yu and Li, Wenbo and Hu, Nan and Chen, Yongrui and Qi, Guilin},
  journal={The Semantic Web - {ISWC} 2023 - 22nd International Semantic Web Conference, Athens, Greece, November 6-10, 2023, Proceedings, Part {I}},
  volume={14265},
  pages ={348--367},
  publisher={Springer},
  year={2023}
}

A framework for detailed evaluation of the ability of GPT family and popular open-source large language models to answer complex questions using their inherent knowledge.

We have released the answers of chatgpt and other models to a total of 194,782 questions across 8 datasets (6 English datasets, 2 multilingual datasets), including multiple languages in Datasets we publish.

This repository is mainly contributed by Yiming Tan, Dehai Min, Yu Li, Wenbo Li, Nan Hu, Guilin Qi.

🔥🎉 We have released the answers of chatgpt and other models to a total of 194,782 questions across 8 datasets, including multiple languages in Datasets we publish.

Overview

To evaluate the ability of large language models such as ChatGPT to answer KB-based complex question answering (KB-based CQA), we proposed an evaluation framework:

First, we designed multiple labels to describe the answer type, reasoning operations required to answer the question, and language type of each test question.

Second, based on the black-box testing specifications proposed by Microsoft's CheckList, we designed an evaluation method that introduces CoT prompts to measure the reasoning capability and reliability of large language models when answering complex questions.

Our evaluation used eight real and complex QA datasets, including six English datasets and two multilingual datasets, to further analyze the potential impact of language type on the performance of large language models.

We compared the evaluation results of FLAN-T5, ChatGPT, GPT3, GPT3.5 series, and GPT-4 to determine the iterative benefits of different models within the GPT family and some commonalities between GPT family models and other LLMs.

Overall results

The following table shows the performance of evaluated models on different datasets, and we also compared them with the current SOTA traditional KBQA models (fine-tuned (FT) and zero-shot (ZS)).

(When evaluating answers, we only consider two situations: answering correctly or answering incorrectly. Therefore, our Acc score is the same as our F1 score.)

Datasets we publish

We classify the answers of these models for the KBQA dataset according to dataset and model, and release them in this folder.

answers_from_LLMs : The response(answers) of these models(Chatgpt, Gpt3/Gpt3.5, FLAN-T5, GPT-4) to the KBQA datasets mentioned in Datasets we use.

Datasets	Size	Col.Size	Lang
KQAPro	117970	106173	EN
LC-quad2.0	26975	26975	EN
WQSP	4737	4700	EN
CWQ	31158	31158	EN
GrailQA	64331	6763	EN
GraphQuestions	4776	4776	EN
QALD-9	6045	6045	Mul
MKQA	260000	6144	Mul
Total Collected	194782

datasets : We have processed the 8 datasets mentioned in Datasets we use into a unified format and released them in this folder. The datasets in the unified format include the following items: question_id, question, ground_truth, SPARQL, and our added labels. Additionally, we have generated alias dictionaries from Wikipedia for the ground truth, which we can use during the evaluation.

Datasets we use

To highlight the complexity of the testing questions and the breadth of the testing dataset, after careful consideration, we selected six representative English monolingual KBQA datasets and two multilingual KBQA datasets for evaluation.

💥 Please note : The links in the Source section below refer to the original datasets as published by their respective authors. For our experiments in this paper, we have processed these datasets accordingly, including random sampling and formatting. Please download the datasets used in our experiments from this folder: datasets.

Monolingual datasets	Source	Paper
WebQuestionSP(WQSP)	Download_url	Paper_url
ComplexWebQuestion(CWQ)	Download_url	Paper_url
GraphQuestions	Download_url	Paper_url
GrailQA	Download_url	Paper_url
KQApro	Download_url	Paper_url
LC-quad2.0	Download_url	Paper_url

Multilingual dataset

Multilingual datasets	Source	Paper
QALD-9	Download_url	Paper_url
MKQA	Download_url	Paper_url

Code for ChatGPT API

We have uploaded our code for using ChatGPT to collect answers to questions in datasets. The code uses the official OpenAI's API. If you want to learn more about APl, see OpenAI's official website for more information: https://platform.openai.com/docs/guides/chat.

Code for evaluating model performance

We have released the code for evaluating the EM score（model performance） of the model's answers in our paper, located in the evaluation_code. We believe it is a good reference for evaluating the correctness of generative language models in question-answering tasks.

CheckList Model

The data for Invariance test (INV) and Directional Expectation test (DIR) are published at: INV_and_DIR

Minimum Functionality Test (MFT)

We assess the LLM's ability to handle each feature in the KB-based CQA scenario through the Minimal Functional Test (MFT); we classify the answer types into 9 categories, respectively Mixed fact (MISC);Reason (WHY);Location (LOC);Time (DATE/TIME);Character (PER);Yes or no (Boolean);Number (NUM);Organization (ORG);Unable to answer (UNA)

At the same time, we divide the labels of "reasoning type" into eight categories, which are: SetOperation; Filtering; Counting; The most valuable; Sort; Single-hop; Multi-hop; Star-shape.

We also take into account the "language type" label that may have an impact on model performance: de; ru; pt; hi_IN; en; Fa; it; fr; ro; es; nl; pt_BR; zh cn.

Invariance test (INV)

We have designed 2 methods to generate test cases for INV:

Randomly introducing spelling errors into the original problem sentence.
Generating a question that is semantically equivalent (paraphrased) to the original problem sentence.

Then, we evaluate the invariance of the LLMs by checking the consistency of their correctness in the outputs generated from 3 inputs (The above two methods generate the question sentence and the original question).

Directional Expectation test (DIR)

We created three modes to generate DIR test cases:

replacing reasoning operation-related phrases in questions to observe LLMs' output.
adding prompts for answer types to test LLMs' control over output.
using multi-round questioning inspired by CoT to observe LLMs' sensitivity and effectiveness with CoT prompts for different question types.

Name		Name	Last commit message	Last commit date
Latest commit History 133 Commits
INV_and_DIR		INV_and_DIR
answers_from_LLMs		answers_from_LLMs
datasets		datasets
evaluation_code		evaluation_code
ChatGPT_API.py		ChatGPT_API.py
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Complex-question-answering-evaluation-of-GPT-family

Citation

Overview

Overall results

Datasets we publish

Datasets we use

Code for ChatGPT API

Code for evaluating model performance

CheckList Model

Minimum Functionality Test (MFT)

Invariance test (INV)

Directional Expectation test (DIR)

About

Releases

Packages

Contributors 4

Languages

License

tan92hl/Complex-Question-Answering-Evaluation-of-GPT-family

Folders and files

Latest commit

History

Repository files navigation

Complex-question-answering-evaluation-of-GPT-family

Citation

Overview

Overall results

Datasets we publish

Datasets we use

Code for ChatGPT API

Code for evaluating model performance

CheckList Model

Minimum Functionality Test (MFT)

Invariance test (INV)

Directional Expectation test (DIR)

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages