GTWiki is a non-parallel dataset for Text-To-Graph (parsing) & Graph-To-Text (generation) tasks. It is used in the framework implemented in our paper: "A multi-task semi-supervised framework for Text2Graph & Graph2Text".
GTWiki can be used for unsupervised learning. The text and graphs are collected from the same entities (176,000) regarding Wikipedia and Wikidata.
- English text: 240,024 instances (one sentence or more per each) of 459.67 characters of average length.
- Graphs: 271,095 instances (1 to 6 triples per each).
Data available at data/monolingual.txt
and data/graphs.txt
respectively.
Alternatively, you can run our collection script and customize it for your needs:
python3 collect.py [WIKIDATA_ID] [WIKIPEDIA_NAME] [MAX_DEPTH]
For example:
python3 collect.py Q762 "Leonardo da Vinci" 1
This execution will collect both, text and graphs, from Leonardo da Vinci and his children in the graph.
Please, for more information about the collection algorithm see our paper.
Previous steps requires Python >= 3.6. One can install all requiremets executing:
pip3 install -r requirements.txt
If you find our work, data or the code useful, please consider to cite our paper.
@misc{domingo2022multitask,
title={A multi-task semi-supervised framework for Text2Graph & Graph2Text},
author={Oriol Domingo and Marta R. Costa-jussà and Carlos Escolano},
year={2022},
eprint={2202.06041},
archivePrefix={arXiv},
primaryClass={cs.CL}
}