HyperExpan: Taxonomy Expansion with Hyperbolic Representation Learning

Source code for paper HyperExpan: Taxonomy Expansion with Hyperbolic Representation Learning in Findings of ACL: EMNLP 2021.

[PDF] [Talk] [Slides] [Cite]

Dependencies

Create new conda environemnt

conda env create -f env.yml

Install DGL 0.4.0 version with GPU suppert using Conda from following page: https://www.dgl.ai/pages/start.html

conda activate hyperexpan
conda install dgl-cuda11.1 -c dglteam

Data Preparation

Download processed data

For dataset used in our paper, you can directly download all input files below and skip this section. We generate the initial feature vectors following TaxoExpan's implementation.

New taxonomy

For expanding new input taxonomies, you need to read this section and format your datasets accordingly.

Step 0.a (Required): Organize your input taxonomy along with node features into the following 3 files

1. <TAXONOMY_NAME>.terms, each line represents one concept in the taxonomy, including its ID and surface name

taxon1_id \t taxon1_surface_name
taxon2_id \t taxon2_surface_name
taxon3_id \t taxon3_surface_name
...

2. <TAXONOMY_NAME>.taxo, each line represents one relation in the taxonomy, including the parent taxon ID and child taxon ID

parent_taxon1_id \t child_taxon1_id
parent_taxon2_id \t child_taxon2_id
parent_taxon3_id \t child_taxon3_id
...

3. <TAXONOMY_NAME>.terms.<EMBED_SUFFIX>.embed, the first line indicates the vocabulary size and embedding dimension, each of the following line represents one taxon with its pretrained embedding

<VOCAB_SIZE> <EMBED_DIM>
taxon1_id taxon1_embedding
taxon2_id taxon2_embedding
taxon3_id taxon3_embedding
...

The embedding file follows the gensim word2vec format.

Notes:

Make sure the <TAXONOMY_NAME> is the same across all the 3 files.
The <EMBED_SUFFIX> is used to chooose what initial embedding you will use. You can leave it empty to load the file "<TAXONOMY_NAME>.terms.embed". Make sure you can generate the embedding for a new given term.

Step 0.b (Optional): Generate train/validation/test partition files

You can generate your desired train/validation/test parition files by creating another 3 separated files (named <TAXONOMY_NAME>.terms.train, <TAXONOMY_NAME>.terms.validation, as well as <TAXONOMY_NAME>.terms.test) and puting them in the same directory as the above three required files.

These three partition files are of the same format -- each line includes one taxon_id that appears in the above <TAXONOMY_NAME>.terms file.

Step 1: Generate the binary dataset file

Create a folder "./data/{DATASET_NAME}"
Put the above three required files (as well as three optional partition files) in "./data/{DATASET_NAME}"
Under this root directory, run

python generate_dataset_binary.py \
    --taxon_name <TAXONOMY_NAME> \
    --data_dir <DATASET_NAME> \
    --embed_suffix <EMBED_SUFFIX> \
    --existing_partition 0 \
    --partition_pattern leaf \

Such as:

python generate_dataset_binary.py \
    --taxon_name wordnet_verb \
    --data_dir data/SemEval-Verb \
    --embed_suffix fasttext_mode4_expan \
    --existing_partition 0 \
    --partition_pattern leaf

This script will first load the existing taxonomy (along with initial node features indicated by embed_suffix) from the previous three required files. Then, if existing_partition is 0, it will generate a random train/validation/test partitions, otherwise, it will load the existing train/validation/test partition files. Notice that if partition_pattern is internal, it will randomly sample both internal and leaf nodes for validation/test, which makes it a taxonomy completion task; if it is set leaf, it will become a taxonomy expansion task. Finally, it saves the generated dataset (along with all initial node features) in one pickle file for fast loading next time.

Model Training and Evaluation

Quick start

Start training and evaluation following the arguments set in a config file under config, this script will run both validation and testing along with training.

python train.py --config config/DATANAME.json

Key arguments in config files

mode: r, p and g, representing initial embedding, LSTM and GNN respectively. If you want to use a combination of initial embedding and GNN encoder, plz set mode to rg, and then the initial embedding and embedding output by GNN encoder will be concatenated for calculating matching score.
type: which model class to be used defined in model/model.py. Set it to ExpanMatchModel for taxonomy expansion, MatchModel for taxonomy completion.
trainer: which trainer class to be used defined in trainer/trainer.py.
hgcn_args/manifold: which hyperbolic model to be used. Options are: PoincareBall, Hyperboloid or Euclidean.
train_data_loader/type: which dataloader class to use, as defined in data_loader/data_loaders.py.
trainer/monitor: which metric to monitor to save the best epoch, max val_mrr_scaled_10 means select the checkpoint with highest validation mrr_scaled_10 result.

Acknowledgement

For all implementations, we follow the project organization in pytorch-template. Part of this codebase is adapted from previous works TMN and TaxoExpan.

Cite

@inproceedings{ma-etal-2021-hyperexpan-taxonomy,
    title = "{H}yper{E}xpan: Taxonomy Expansion with Hyperbolic Representation Learning",
    author = "Ma, Mingyu Derek  and
      Chen, Muhao  and
      Wu, Te-Lin  and
      Peng, Nanyun",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-emnlp.353",
    doi = "10.18653/v1/2021.findings-emnlp.353",
    pages = "4182--4194",
    abstract = "Taxonomies are valuable resources for many applications, but the limited coverage due to the expensive manual curation process hinders their general applicability. Prior works attempt to automatically expand existing taxonomies to improve their coverage by learning concept embeddings in Euclidean space, while taxonomies, inherently hierarchical, more naturally align with the geometric properties of a hyperbolic space. In this paper, we present HyperExpan, a taxonomy expansion algorithm that seeks to preserve the structure of a taxonomy in a more expressive hyperbolic embedding space and learn to represent concepts and their relations with a Hyperbolic Graph Neural Network (HGNN). Specifically, HyperExpan leverages position embeddings to exploit the structure of the existing taxonomies, and characterizes the concept profile information to support the inference on new concepts that are unseen during training. Experiments show that our proposed HyperExpan outperforms baseline models with representation learning in a Euclidean feature space and achieves state-of-the-art performance on the taxonomy expansion benchmarks.",
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
base		base
config		config
data_loader		data_loader
hype		hype
logger		logger
model		model
trainer		trainer
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
closestedge.py		closestedge.py
closestneighbor.py		closestneighbor.py
env.yml		env.yml
generate_dataset_binary.py		generate_dataset_binary.py
infer.py		infer.py
parse_config.py		parse_config.py
run.sh		run.sh
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HyperExpan: Taxonomy Expansion with Hyperbolic Representation Learning

Dependencies

Data Preparation

Download processed data

New taxonomy

Step 0.a (Required): Organize your input taxonomy along with node features into the following 3 files

Step 0.b (Optional): Generate train/validation/test partition files

Step 1: Generate the binary dataset file

Model Training and Evaluation

Quick start

Key arguments in config files

Acknowledgement

Cite

About

Releases

Packages

Languages

License

PlusLabNLP/HyperExpan

Folders and files

Latest commit

History

Repository files navigation

HyperExpan: Taxonomy Expansion with Hyperbolic Representation Learning

Dependencies

Data Preparation

Download processed data

New taxonomy

Step 0.a (Required): Organize your input taxonomy along with node features into the following 3 files

Step 0.b (Optional): Generate train/validation/test partition files

Step 1: Generate the binary dataset file

Model Training and Evaluation

Quick start

Key arguments in config files

Acknowledgement

Cite

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages