Skip to content

μKG: A Library for Multi-source Knowledge Graph Embeddings and Applications, ISWC 2022

License

Notifications You must be signed in to change notification settings

nju-websoft/muKG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

μKG is an open-source Python library for representation learning over knowledge graphs. μKG supports joint representation learning over multi-source knowledge graphs (and also a single knowledge graph), multiple deep learning libraries (PyTorch and TF2), multiple embedding tasks (link prediction, entity alignment, entity typing, and multi-source link prediction), and multiple parallel computing modes (multi-process and multi-GPU computing).

Table of contents

  1. Introduction of μKG 📃
    1. Overview
    2. Package Description
  2. Getting Started 🚀
    1. Dependencies
    2. Installation
    3. Usage
  3. Models hub 🏠
    1. KGE models
    2. EA models
    3. ET models
  4. Datasets hub 🏠
    1. KGE datasets
    2. EA datasets
    3. ET datasets
  5. Utils 📂
    1. Sampler
    2. Evaluator
    3. ET datasets
    4. Multi-GPU and multi-processing computation
  6. Running Experiments 🔬
  7. License
  8. Citation

Introduction of μKG 📃

Overview

We use Python , Tensorflow and PyTorch to develop the basic framework of μKG. And using RAY for distributed training. The software architecture is illustrated in the following Figure.

image-20220507103409697

Compared with other existing KG systems, μKG has the following competitive features.

👍Comprehensive. μKG is a full-featured Python library for representation learning over a single KG or multi-source KGs. It is compatible with the two widely-used deep learning libraries PyTorch and TensorFlow 2, and can therefore be easily integrated into downstream applications. It integrates a variety of KG embedding models and supports four KG tasks including link prediction, entity alignment, entity typing, and multi-source link prediction.

Fast and scalable. μKG provides advanced implementations of KG embedding techniques with the support of multi-process and multi-GPU parallel computing, making it fast and scalable to large KGs.

🤳Easy-to-use. μKG provides simplified pipelines of KG embedding tasks for easy use. Users can interact with μKG with both method APIs and the command line. It also has high-quality documentation.

😀Continuously updated. Our team will keep up-to-date on new related techniques and integrate new (multi-source) KG embedding models, tasks, and datasets into μKG. We will also keep improving existing implementations.

Package Description

μKG/
├── src/
│   ├── py/: a Python-based toolkit used for the upper layer of μKG
		|── data/: a collection of datasets used for knowledge graph reasoning
		|── args/: json files used for configuring hyperparameters of training process
		|── evaluation/: package of the implementations for supported downstream tasks
		|── load/: toolkit used for data loading and processing
		|── base/: package of the implementations for different initializers, losses and optimizers
		|── util/: package of the implementations for checking virtual environment
│   ├── tf/: package of the implementations for KGE models, EA models and ET models in TensorFlow 2
│   ├── torch/: package of the implementations for KGE models, EA models and ET models in PyTorch

Getting Started 🚀

Dependenciespython3

μKG supports PyTorch and TensorFlow 2 deep learning libraries, users can choose one of the following two dependencies according to their preferences.

  • Torch 1.10.2 | Tensorflow 2.x
  • Ray 1.12.0
  • Scipy
  • Numpy
  • Igraph
  • Pandas
  • Scikit-learn
  • Gensim
  • Tqdm

Installation 🔧

We suggest you create a new conda environment firstly. We provide two installation instructions for tensorflow-gpu (tested on 2.3.0) and pytorch (tested on 1.10.2). Note that there is a difference between the Ray 1.10.0 and Ray 1.12.0 in batch generation. The Ray 1.12.0 is used as an example.

# command for Tensorflow
conda create -n muKG python=3.8
conda activate muKG
conda install tensorflow-gpu==2.3.0
conda install -c conda-forge python-igraph
pip install -U ray==1.12.0

To install PyTorch, you must install Anaconda and follow the instructions on the PyTorch website. For example, if you’re using CUDA version 11.3, use the following command:

# command for PyTorch
conda create -n muKG python=3.8
conda activate muKG
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
conda install -c conda-forge python-igraph
pip install -U ray==1.12.0

The latest code can be installed by the following instructions:

git clone https://github.com/nju-websoft/muKG.git muKG
cd muKG
pip install -e .

Usage 📝

Currently, there are two ways to do your job. Here we provide tutorials of using command line as well as editing file to configure your model. The following is an example about how to use μKG in Python. You can choose different tasks, select the specific model and change the mode (training or evaluation) here. The hyperparameter files are stored in the subfolder args. It maintains compelete details for training process.

model_name = 'model name'
kg_task = 'selected KG task'
if kg_task == 'ea':
	args = load_args("hyperparameter file folder of entity alignment task")
elif kg_task == 'lp':
	args = load_args("hyperparameter file folder of link prediction task")
else:
	args = load_args("hyperparameter file folder of entity typing task")
kgs = read_kgs_from_folder()
if kg_task == 'ea':
	model = ea_models(args, kgs)
elif kg_task == 'lp':
	model = kge_models(args, kgs)
else:
	model = et_models(args, kgs)
model.get_model('model name')
model.run()
model.test()

To run a model on a dataset with the following command line. We show an example of training TransE on FB15K here. The hyperparameters will default to the corresponding json file in the args_kge folder.

# -t:lp, ea, et -m: selected model name -o train and valid -d selected dataset
python main_args.py -t lp -m transe -o train -d data/FB15K

Models hub 🏠

μKG has implemented 26 KG models. The citation for each models corresponds to either the paper describing the model. According to different knowledge graph downstream tasks, we divided the models into three categories. It is available for you to add your own models under one of the three folders.

KGE models

Name Citation
TransE Bordes et al., 2013
TransR Lin et al., 2015
TransD Ji et al., 2015
TransH Wang et al., 2014
TuckER Balažević et al., 2019
RotatE Sun et al., 2019
SimplE Kazemi et al., 2018
RESCAL Nickel et al., 2011
ComplEx Trouillon et al., 2016
Analogy Liu et al., 2017
DistMult Yang et al., 2014
HolE Nickel et al., 2016
ConvE Dettmers et al., 2018

EA models

Name Citation
MTransE Chen et al., 2017
IPTransE Zhu et al., 2017
BootEA Sun et al., 2018
JAPE Sun et al., 2017
IMUSE He et al., 2019
RDGCN Wu et al., 2019
AttrE Trisedya et al., 2019
SEA Pei et al., 2019
GCN-Align Wang et al., 2018
RSN4EA Guo et al., 2019

ET models

Name Citation
TransE Bordes et al., 2013
RESCAL Nickel et al., 2011
HolE Nickel et al., 2016

Datasets hub 🏠

μKG has bulit in 16 KG datasets for different downstream tasks. Here we list the number of entities, relations, train triples, valid triples and test triples for these datasets. You can prepare your own datasets in the Datasets hub. Firstly, you should create a subfolder dataset name in the data folder, then put your train.txt, valid.txt and test.txt files in this folder. The data should be in the triple format.

KGE datasets

Datasets Name Entities Relations Train Valid Test Citation
FB15K 14951 1345 483142 50000 59071 Bordes et al., 2013
FB15K237 14541 237 272115 17535 20466 Bordes et al., 2013
WN18RR 40943 11 86835 3034 3134 Toutanova et al., 2015
WN18 40943 18 141442 5000 5000 Bordes et al., 2013
WN11 38588 11 112581 2609 10544 Socher et al., 2013
DBpedia50 49900 654 23288 399 10969 Shi et al., 2017
DBpedia500 517475 654 3102677 10000 1155937
Countries 271 2 1111 24 24 Bouchard et al., 2015
FB13 75043 13 316232 5908 23733 Socher et al., 2013
Kinsip 104 25 8544 1086 1074 Kemp et al., 2006
Nations 14 55 1592 199 201 ZhenfengLei/KGDatasets
NELL-995 75492 200 149678 543 3992 Nathani et al., 2019
UMLS 75492 135 5216 652 661 ZhenfengLei/KGDatasets

EA datasets

Datasets name Entities Relations Triples Citation
OpenEA supported 15000 248 38265 Sun et al., 2020

ET datasets

Datasets name Entities Relations Triples Types Citation
FB15K-ET 15000 248 38265 3851 Moon et al., 2017

Utils 📂

Sampler

Negative sampler:

μKG includes several negative sampling methods to randomly generate negative examples.

  • Uniform negative sampling: This method replaces an entity in a triple or an alignment pair with another randomly-sampled entity to generate a negative example. It gives each entity the same replacement probability.
  • Self-adversarial negative sampling: This method samples negative triples according to the current embedding model.
  • Truncated negative sampling: This method seeks to generate hard negative examples.

Path sampler: The Path sampler is to support some embedding models that are built by modeling the paths of KGs, such as IPTransE and RSN4EA. It can generate relational path like (e_1, r_1, e_2, r_2, e_3), entity path like (e_1, e_2, e_3), and relation path like (r_1, r_2).

Subgraph sampler: The subgraph sampler is to support GNN-based embedding models like GCN-Align and AliNet. It can generate both first-order (i.e., one-hop) and high-order (i.e., multi-hop) neighborhood subgraphs of entities.

Evaluator

(joint) link prediction & entity typing: This module is inspired by TorchKGE, a PyTorch-based library for efficient training and evaluation of KG embedding. It uses the energy function to compute the plausibility of a candidate triple. The implemented metrics for assessing the performance of embedding tasks include Hits@K, mean rank (MR) and mean reciprocal rank (MRR). The hyperparameter json file stored in args subfolder allows you to set Hits@K.

entity alignment: It provides several metrics to measure entity embedding similarities, such as the cosine, inner, Euclidean distance, and cross-domain similarity local scaling. The evaluation process can be accelerated using multiprocessing.

Multi-GPU and multi-processing computation

We use Ray to provide a uniform and easy-to-use interface for multi-GPU and multi-processing computation. The following figure shows our Ray-based implementation for parallel computing and the code snippet to use it. Users can set the number of CPUs or GPUs used for model training.

image-20220507172436866

To use the following command line to train your model with multi-GPU and multi-processing. Firstly check the number of resources on your machine (GPU or CPU), and then specify the number of parallels. The system will automatically allocate resources for each worker working in parallel.

# When you run on one or more GPUs, use os.environ['CUDA_VISIBLE_DEVICES'] to set GPU id list first 
python main_args.py -t lp -m transe -o train -d data/FB15K -r gpu:2 -w 2  

Running Experiments 🔬

Instruction

We have provided the hyper-parameters of some models for critical experiments in the paper. These scripts can be founded in the folder experiments. You can simply select the specific model in the corresponding Python file to reproduce experiments. And we recommend you to check GPU resources when doing experiments on efficiency. Then add the following code to set GPU IDs for all RAY workers.

os.environ['CUDA_VISIBLE_DEVICES'] = "GPU IDs set"

Efficiency of multi-GPU training

We give the evaluation results of the efficiency of the proposed library μKG here. The experiments were conducted on a server with an Intel Xeon Gold 6240 2.6GHz CPU, 512GB of memory and four NVIDIA Tesla V100 GPUs. The following figure compares the training time of RotatE and ConvE on FB15K-237 when using different numbers of GPUs.

image-20220508150812794

Training time comparison of different libraries

We further compare the training time used by μKG with LibKGE and PyKEEN. The backbone of μKG in this experiment is also PyTorch. We use the same hyper-parameter settings (e.g., batch size and maximum training epochs) for each model in the three libraries. The following table gives the training time of ConvE and RotatE on FB15K-237 with a single GPU for calculation.

Models μKG LibKGE PyKEEN
RotatE 639 s 3,260 s 1,085 s
ConvE 824 s 1,801 s 961 s

License

This project is licensed under the GPL License - see the LICENSE file for details

Citation

@inproceedings{muKG,
  author    = {Xindi Luo and
  	       Zequn Sun and
               Wei Hu},
  title     = {μKG: A Library for Multi-source Knowledge Graph Embeddings and Applications},
  booktitle = {ISWC},
  year      = {2022}
}