WikiQA-Salience: A dataset for evaluating entity salience prediction on extremely short documents

Introduction

WikiQA-Salience is a dataset for evaluating entity salience prediction on extremely short question-answer pair passages.

We leveraged the WikiQA dataset as a starting point to create a new entity salience dataset from publicly available data. The WikiQA corpus is an answer sentence selection (AS2) dataset where the questions are derived from query logs of the Bing search engine, and the answer candidates are extracted from Wikipedia. The examples are Q/A pairs in natural language with full-sentence (non-factoid) answers, which resemble the type of responses provided by conversational assistants.

Our dataset augments a subset of the Q/A pairs in WikiQA with named entities extracted from the question and answer text and linked to Wikidata and ground truth labels of the salience of each entity to the Q\A pair.

Construction

We first selected the Q/A pairs in the WikiQA corpus where the answer is labeled as correctly answering the question (positive pairs). Then we applied the ReFinED named entity resolution model to the combined question-answer text to extract named entities and augmented each entity with the name, description and aliases of the entity from WikiData. Since WikiData descriptions are typically extremely brief, we further augment the entities in the dataset with more detailed information from Wikipedia pages (wherever these are available) including the Wikipedia summary (i.e., the first section of the page) and the first 100 noun-phrases from the article. Ground truth labels were generated by crowd workers on the Amazon Mechanical Turk platform who rated the relevance of each entity to the Q/A pair it was extracted from on a three level scale (“High”, “Moderate”, “Low”).

The finished dataset consists of 687 annotated Q/A pairs with the linked entity data from ReFinED, entity details from WikiData, and the (5-pass) crowd worker salience ratings. The 687 Q/A pairs contain 2113 entities (unique at the Q/A pair level), and the mean length of the question-answer text is just 190.6 characters and 32.9 words. The distribution of the entity salience labels is significantly skewed towards salient entities with 1089 rated “High”, 535 rated “Moderate” and 489 rated “Low”. Additional details on the construction of the dataset can be found in our paper (TBD).

Statistics

	mean	std	min	max
characters	190.6	68.4	49	589
words	32.9	11.2	8	89

Table 1: Q/A pair context size

number of entities	count
2	281
3	188
4	125
5	71
6	22

Table 2: Distribution of number of entities

median rating	count
High	1089
Moderate	535
Low	489
Total	2113

Table 3: Distribution of ground truth labels

text: the mention text
category: the coarse mention type (from ReFinED)
predicted-entity-types: the predicted entity types
wikidata-entity-id: the WikiData entity ID
el-score: the ReFinED entity linking model confidence score
start-char: the start character of the mention text within the passage
end-char: the end character of the mention text within the passage
backend: the name of the entity linking model (i.e. "refined")
wikidata-entity-name: the canonical name of the entity in WikiData
wikidata-entity-description: a short textual description of the entity from WikiData
wikidata-entity-aliases: a list of aliases for the entity from WikiData
gt-rating-mean: the mean normalized numeric rating in the range [0, 1]
gt-rating-std: the standard deviation of the normalized numeric ratings.
gt-rating-median: the median normalized numeric rating in the range [0, 1]
gt-ratings-raw: a list of strings containing the ratings from each pass of annotation from the set {"High", "Moderate", "Low"}.
sum-first-section: Wikipedia page summary from the first section of the page
sum-noun-phrase-spacy: the first 100 noun-phrases from the article
sum-keywords-spacy: first 100 key phrases using Spacy
sum-keywords-rake: first 100 key phrases using Rake

Joining with the original WIKIQA dataset

The WikiQA-Salience dataset does not contain the question and answer pairs from the original WikiQA dataset only the extracted entities, enrichments from Wikidata/Wikipedia, and the ground truth labels. The datasets can be easily merge as described below, where WIKIQA_PATH point to the WikiQA.tsv file from the WikiQA corpus and the ENTITY_SALIENCE_DATA_PATH points to the wikiqa_salience.jsonl file from this dataset.

import pandas as pd
wikiqa_df = pd.read_csv(WIKIQA_PATH, sep="\t")
entities_df = pd.read_json(ENTITY_SALIENCE_DATA_PATH, lines=True)
joined_df = entities_df.join(wikiqa_df.set_index(["QuestionID", "SentenceID"]), on=["QuestionID", "SentenceID"], how="inner")

Security

See CONTRIBUTING for more information.

Cite

Please cite our paper if you use this dataset for your own research:

@inproceedings{bullough-etal-2024-predicting,
    title = "Predicting Entity Salience in Extremely Short Documents",
    author = "Bullough, Benjamin  and
      Lundberg, Harrison  and
      Hu, Chen  and
      Xiao, Weihang",
    editor = "Dernoncourt, Franck  and
      Preo{\c{t}}iuc-Pietro, Daniel  and
      Shimorina, Anastasia",
    booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track",
    month = nov,
    year = "2024",
    address = "Miami, Florida, US",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.emnlp-industry.5",
    pages = "50--64",
    abstract = "A frequent challenge in applications that use entities extracted from text documents is selecting the most salient entities when only a small number can be used by the application (e.g., displayed to a user). Solving this challenge is particularly difficult in the setting of extremely short documents, such as the response from a digital assistant, where traditional signals of salience such as position and frequency are less likely to be useful. In this paper, we propose a lightweight and data-efficient approach for entity salience detection on short text documents. Our experiments show that our approach achieves competitive performance with respect to complex state-of-the-art models, such as GPT-4, at a significant advantage in latency and cost. In limited data settings, we show that a semi-supervised fine-tuning process can improve performance further. Furthermore, we introduce a novel human-labeled dataset for evaluating entity salience on short question-answer pair documents.",
}

License

This project is licensed under the CC-BY-SA 4.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE.txt		NOTICE.txt
README.md		README.md
wikiqa_salience.jsonl		wikiqa_salience.jsonl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WikiQA-Salience: A dataset for evaluating entity salience prediction on extremely short documents

Introduction

Construction

Statistics

Contents

Joining with the original WIKIQA dataset

Security

Cite

License

About

Releases

Packages

License

amazon-science/entity-salience-short-documents

Folders and files

Latest commit

History

Repository files navigation

WikiQA-Salience: A dataset for evaluating entity salience prediction on extremely short documents

Introduction

Construction

Statistics

Contents

Joining with the original WIKIQA dataset

Security

Cite

License

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Packages