μKG is an open-source Python library for representation learning over knowledge graphs. μKG supports joint representation learning over multi-source knowledge graphs (and also a single knowledge graph), multiple deep learning libraries (PyTorch and TF2), multiple embedding tasks (link prediction, entity alignment, entity typing, and multi-source link prediction), and multiple parallel computing modes (multi-process and multi-GPU computing).
- Introduction of μKG 📃
- Getting Started 🚀
- Models hub 🏠
- Datasets hub 🏠
- Utils 📂
- Running Experiments 🔬
- License
- Citation
We use Python , Tensorflow and PyTorch to develop the basic framework of μKG. And using RAY for distributed training. The software architecture is illustrated in the following Figure.
Compared with other existing KG systems, μKG has the following competitive features.
👍Comprehensive. μKG is a full-featured Python library for representation learning over a single KG or multi-source KGs. It is compatible with the two widely-used deep learning libraries PyTorch and TensorFlow 2, and can therefore be easily integrated into downstream applications. It integrates a variety of KG embedding models and supports four KG tasks including link prediction, entity alignment, entity typing, and multi-source link prediction.
⚡Fast and scalable. μKG provides advanced implementations of KG embedding techniques with the support of multi-process and multi-GPU parallel computing, making it fast and scalable to large KGs.
🤳Easy-to-use. μKG provides simplified pipelines of KG embedding tasks for easy use. Users can interact with μKG with both method APIs and the command line. It also has high-quality documentation.
😀Continuously updated. Our team will keep up-to-date on new related techniques and integrate new (multi-source) KG embedding models, tasks, and datasets into μKG. We will also keep improving existing implementations.
μKG/
├── src/
│ ├── py/: a Python-based toolkit used for the upper layer of μKG
|── data/: a collection of datasets used for knowledge graph reasoning
|── args/: json files used for configuring hyperparameters of training process
|── evaluation/: package of the implementations for supported downstream tasks
|── load/: toolkit used for data loading and processing
|── base/: package of the implementations for different initializers, losses and optimizers
|── util/: package of the implementations for checking virtual environment
│ ├── tf/: package of the implementations for KGE models, EA models and ET models in TensorFlow 2
│ ├── torch/: package of the implementations for KGE models, EA models and ET models in PyTorch
μKG supports PyTorch and TensorFlow 2 deep learning libraries, users can choose one of the following two dependencies according to their preferences.
- Torch 1.10.2 | Tensorflow 2.x
- Ray 1.12.0
- Scipy
- Numpy
- Igraph
- Pandas
- Scikit-learn
- Gensim
- Tqdm
We suggest you create a new conda environment firstly. We provide two installation instructions for tensorflow-gpu (tested on 2.3.0) and pytorch (tested on 1.10.2). Note that there is a difference between the Ray 1.10.0 and Ray 1.12.0 in batch generation. The Ray 1.12.0 is used as an example.
# command for Tensorflow
conda create -n muKG python=3.8
conda activate muKG
conda install tensorflow-gpu==2.3.0
conda install -c conda-forge python-igraph
pip install -U ray==1.12.0
To install PyTorch, you must install Anaconda and follow the instructions on the PyTorch website. For example, if you’re using CUDA version 11.3, use the following command:
# command for PyTorch
conda create -n muKG python=3.8
conda activate muKG
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
conda install -c conda-forge python-igraph
pip install -U ray==1.12.0
The latest code can be installed by the following instructions:
git clone https://github.com/nju-websoft/muKG.git muKG
cd muKG
pip install -e .
Currently, there are two ways to do your job. Here we provide tutorials of using command line as well as editing file to configure your model. The following is an example about how to use μKG in Python. You can choose different tasks, select the specific model and change the mode (training or evaluation) here. The hyperparameter files are stored in the subfolder args
. It maintains compelete details for training process.
model_name = 'model name'
kg_task = 'selected KG task'
if kg_task == 'ea':
args = load_args("hyperparameter file folder of entity alignment task")
elif kg_task == 'lp':
args = load_args("hyperparameter file folder of link prediction task")
else:
args = load_args("hyperparameter file folder of entity typing task")
kgs = read_kgs_from_folder()
if kg_task == 'ea':
model = ea_models(args, kgs)
elif kg_task == 'lp':
model = kge_models(args, kgs)
else:
model = et_models(args, kgs)
model.get_model('model name')
model.run()
model.test()
To run a model on a dataset with the following command line. We show an example of training TransE on FB15K here. The hyperparameters will default to the corresponding json file in the args_kge
folder.
# -t:lp, ea, et -m: selected model name -o train and valid -d selected dataset
python main_args.py -t lp -m transe -o train -d data/FB15K
μKG has implemented 26 KG models. The citation for each models corresponds to either the paper describing the model. According to different knowledge graph downstream tasks, we divided the models into three categories. It is available for you to add your own models under one of the three folders.
Name | Citation |
---|---|
TransE | Bordes et al., 2013 |
TransR | Lin et al., 2015 |
TransD | Ji et al., 2015 |
TransH | Wang et al., 2014 |
TuckER | Balažević et al., 2019 |
RotatE | Sun et al., 2019 |
SimplE | Kazemi et al., 2018 |
RESCAL | Nickel et al., 2011 |
ComplEx | Trouillon et al., 2016 |
Analogy | Liu et al., 2017 |
DistMult | Yang et al., 2014 |
HolE | Nickel et al., 2016 |
ConvE | Dettmers et al., 2018 |
Name | Citation |
---|---|
MTransE | Chen et al., 2017 |
IPTransE | Zhu et al., 2017 |
BootEA | Sun et al., 2018 |
JAPE | Sun et al., 2017 |
IMUSE | He et al., 2019 |
RDGCN | Wu et al., 2019 |
AttrE | Trisedya et al., 2019 |
SEA | Pei et al., 2019 |
GCN-Align | Wang et al., 2018 |
RSN4EA | Guo et al., 2019 |
Name | Citation |
---|---|
TransE | Bordes et al., 2013 |
RESCAL | Nickel et al., 2011 |
HolE | Nickel et al., 2016 |
μKG has bulit in 16 KG datasets for different downstream tasks. Here we list the number of entities, relations, train triples, valid triples and test triples for these datasets. You can prepare your own datasets in the Datasets hub. Firstly, you should create a subfolder dataset name
in the data
folder, then put your train.txt, valid.txt and test.txt files in this folder. The data should be in the triple format.
Datasets Name | Entities | Relations | Train | Valid | Test | Citation |
---|---|---|---|---|---|---|
FB15K | 14951 | 1345 | 483142 | 50000 | 59071 | Bordes et al., 2013 |
FB15K237 | 14541 | 237 | 272115 | 17535 | 20466 | Bordes et al., 2013 |
WN18RR | 40943 | 11 | 86835 | 3034 | 3134 | Toutanova et al., 2015 |
WN18 | 40943 | 18 | 141442 | 5000 | 5000 | Bordes et al., 2013 |
WN11 | 38588 | 11 | 112581 | 2609 | 10544 | Socher et al., 2013 |
DBpedia50 | 49900 | 654 | 23288 | 399 | 10969 | Shi et al., 2017 |
DBpedia500 | 517475 | 654 | 3102677 | 10000 | 1155937 | |
Countries | 271 | 2 | 1111 | 24 | 24 | Bouchard et al., 2015 |
FB13 | 75043 | 13 | 316232 | 5908 | 23733 | Socher et al., 2013 |
Kinsip | 104 | 25 | 8544 | 1086 | 1074 | Kemp et al., 2006 |
Nations | 14 | 55 | 1592 | 199 | 201 | ZhenfengLei/KGDatasets |
NELL-995 | 75492 | 200 | 149678 | 543 | 3992 | Nathani et al., 2019 |
UMLS | 75492 | 135 | 5216 | 652 | 661 | ZhenfengLei/KGDatasets |
Datasets name | Entities | Relations | Triples | Citation |
---|---|---|---|---|
OpenEA supported | 15000 | 248 | 38265 | Sun et al., 2020 |
Datasets name | Entities | Relations | Triples | Types | Citation |
---|---|---|---|---|---|
FB15K-ET | 15000 | 248 | 38265 | 3851 | Moon et al., 2017 |
Negative sampler:
μKG includes several negative sampling methods to randomly generate negative examples.
- Uniform negative sampling: This method replaces an entity in a triple or an alignment pair with another randomly-sampled entity to generate a negative example. It gives each entity the same replacement probability.
- Self-adversarial negative sampling: This method samples negative triples according to the current embedding model.
- Truncated negative sampling: This method seeks to generate hard negative examples.
Path sampler: The Path sampler is to support some embedding models that are built by modeling the paths of KGs, such as IPTransE and RSN4EA. It can generate relational path like (e_1, r_1, e_2, r_2, e_3), entity path like (e_1, e_2, e_3), and relation path like (r_1, r_2).
Subgraph sampler: The subgraph sampler is to support GNN-based embedding models like GCN-Align and AliNet. It can generate both first-order (i.e., one-hop) and high-order (i.e., multi-hop) neighborhood subgraphs of entities.
(joint) link prediction & entity typing: This module is inspired by TorchKGE, a PyTorch-based library for efficient training and evaluation of KG embedding. It uses the energy function to compute the plausibility of a candidate triple. The implemented metrics for assessing the performance of embedding tasks include Hits@K, mean rank (MR) and mean reciprocal rank (MRR). The hyperparameter json file stored in args
subfolder allows you to set Hits@K.
entity alignment: It provides several metrics to measure entity embedding similarities, such as the cosine, inner, Euclidean distance, and cross-domain similarity local scaling. The evaluation process can be accelerated using multiprocessing.
We use Ray to provide a uniform and easy-to-use interface for multi-GPU and multi-processing computation. The following figure shows our Ray-based implementation for parallel computing and the code snippet to use it. Users can set the number of CPUs or GPUs used for model training.
To use the following command line to train your model with multi-GPU and multi-processing. Firstly check the number of resources on your machine (GPU or CPU), and then specify the number of parallels. The system will automatically allocate resources for each worker working in parallel.
# When you run on one or more GPUs, use os.environ['CUDA_VISIBLE_DEVICES'] to set GPU id list first
python main_args.py -t lp -m transe -o train -d data/FB15K -r gpu:2 -w 2
We have provided the hyper-parameters of some models for critical experiments in the paper. These scripts can be founded in the folder experiments. You can simply select the specific model in the corresponding Python file to reproduce experiments. And we recommend you to check GPU resources when doing experiments on efficiency. Then add the following code to set GPU IDs for all RAY workers.
os.environ['CUDA_VISIBLE_DEVICES'] = "GPU IDs set"
We give the evaluation results of the efficiency of the proposed library μKG here. The experiments were conducted on a server with an Intel Xeon Gold 6240 2.6GHz CPU, 512GB of memory and four NVIDIA Tesla V100 GPUs. The following figure compares the training time of RotatE and ConvE on FB15K-237 when using different numbers of GPUs.
We further compare the training time used by μKG with LibKGE and PyKEEN. The backbone of μKG in this experiment is also PyTorch. We use the same hyper-parameter settings (e.g., batch size and maximum training epochs) for each model in the three libraries. The following table gives the training time of ConvE and RotatE on FB15K-237 with a single GPU for calculation.
Models | μKG | LibKGE | PyKEEN |
---|---|---|---|
RotatE | 639 s | 3,260 s | 1,085 s |
ConvE | 824 s | 1,801 s | 961 s |
This project is licensed under the GPL License - see the LICENSE file for details
@inproceedings{muKG,
author = {Xindi Luo and
Zequn Sun and
Wei Hu},
title = {μKG: A Library for Multi-source Knowledge Graph Embeddings and Applications},
booktitle = {ISWC},
year = {2022}
}