This repository contains the GASP dataset.
GASP is a dataset composed by list of cited abstracts associated with the corresponding source abstract.
The goal is to generete a paper abstract give cited paper's abstracts and model the human creativity behind the process.
Creativity is one of the driving forces of human kind as it allows to break current understanding to envision new ideas, which may revolutionize entire fields of knowledge. Scientific research offers a challenging environment where to learn creativity. In fact, scientific research is a creative act in the formal settings of the scientific method and this creative act is describe in articles.
Here, we dare to introduce the novel as well as scientifically and philosophically challenging task of Generating Abstracts of Scientific Papers from the abstracts of their citations (GASP) as a text-to-text rewriting task to investigate scientific creativity. To foster research about this original challenge, we set up an annotated dataset by using services that solved the copyrighht problems. As a result, the dataset is publicly available and offers examples of real papers and a reference training-test split. Finally, three vanilla summarization systems have been already applied to the dataset whose outcomes allowed to early measure the GASP task complexity.
The GASP task is aimed to producing the target abstract of a paper "output_paper"
given the abstracts of the set of referred papers "input_papers"
. This latter is a list of abstracts, which we may assume have been inspirational for the idea in the target paper.
The dataset is composed by a training set of 100000 elements, a test set and a validation set of 10000 each.
Each set is a .json
file composed by a list of elements following the structure below:
[{
"input_papers": [
"CONTEXT Mentoring, as a partnership ... issues, and using cross-disciplinary approaches.",
"PROBLEM AND BACKGROUND In 1998, the University of ... retention in academic medicine.",
"PURPOSE To determine (1) the prevalence of ... deliberate approach to the practice of mentoring.",
...],
"output_paper": "The purpose of this article is to ... proposed next steps for research in this area.",
"output_id": "363801205efb14a28dea8cbcdc86afc2eb908f53"
},
...
]
Where:
"input_papers"
is a variable lenght list of string of papers"output_paper"
is the text of the output paper (string)"output_id"
is the S2 Paper ID of the output paper that can be used to look for additional information with Semantic Scholar API
More information in the paper
The corpus is available to download from the ART Site
Please, if you use the GASP dataset cite:
@article{Zanzotto_Bono_Vocca_Santilli_Croce_Gambosi_Basili_2020,
title={GASP! Generating Abstracts of Scientific Papers from Abstracts of Cited Papers},
url={http://rgdoi.net/10.13140/RG.2.2.20755.22562},
DOI={10.13140/RG.2.2.20755.22562},
author={Zanzotto, Fabio Massimo and Bono, Viviana and Vocca, Paola and
Santilli, Andrea and Croce, Danilo and Gambosi, Giorgio and Basili, Roberto},
year={2020} }
GASP dataset is licensed under ODC-BY.
The following table is necessary for this dataset to be indexed by search engines such as Google Dataset Search.
property | value | ||||||
---|---|---|---|---|---|---|---|
name | Generating Abstracts of Scientific Papers from Abstracts of Cited Papers |
||||||
alternateName | GASP |
||||||
url | https://github.com/ART-Group-it/GASP |
||||||
sameAs | https://github.com/ART-Group-it/GASP |
||||||
description | The dataset consists of list of cited abstracts associated with the corresponding source abstract. The goal is to generete the abstract of a target paper given the abstracts of cited papers. |
||||||
provider |
|
||||||
citation | http://doi.org/10.13140/RG.2.2.20755.22562 |
If you have comments, suggestions or you want to communicate results, please contact:
fabio.massimo.zanzotto@uniroma2.it
GASP! is a result of research in this project: HitAI.org Human-in-the-loop Artificial Intelligence
Our corpus is based on the Semantic Scholar Corpus. We would like to thank them for their important contribution. If you are using our corpus, please cite also their work.