WordNet dataset with semantic relations only
In WordNet two kinds of relations are recognized: lexical and semantic. Lexical relations hold between word forms (lemmas); semantic relations hold between word meanings (synsets).
I wanted to have a dataset with the lexical relations filtered out to build synset embeddings based only on the semantic relations of the WN graph.
In the dataset folder, you can find many tsv
and txt
files the meaning of which is explained hereafter.
file name | purpose | notes |
---|---|---|
count_synsets.txt |
File that contains the number of synsets. | |
count_relations.txt |
Files that contain the number of relations. | |
count_edges_all.txt |
File that contains the number of total edges. | |
count_edges_*.tsv |
Files that contain the number of edges of type *. | |
synset_name_to_id.tsv |
File that maps each synset's name to a numeric id starting from 0. | The file is sorted on the first column. |
synset_id_to_name.tsv |
File that maps each synset id to a synset's name. | The file is sorted on the first column. |
relation_name_to_id.tsv |
File that maps each relation to a numeric id starting from 0. | The file is sorted on the first column. |
relation_id_to_name.tsv |
File that maps each relation id to a relation's name. | The file is sorted on the first column. |
edges_as_id_all.tsv |
File that contains all the edges of the WordNet's semantic subgraph as triples of ids (id synset 1, id relation, id synset 2). | The file is sorted on the second column. |
edges_as_id_*.tsv |
Files that contain only the edges of type *. | The file is sorted on the second column. |
edges_as_name_all.tsv |
File that contains all the edges of the WordNet's semantic subgraph as triples of names (name synset 1, name relation, name synset 2). | The file is sorted on the second column. |
edges_as_name_*.tsv |
Files that contain only the edges of type *. | The file is sorted on the second column. |
A compressed version of the dataset can be downloaded from the release page or by clicking here.
The dataset is generated using nltk and is a subset of the WordNet dataset.
All source code of this project is licensed under the MIT License - see the license file for details.