Zhangyang Gao*, Daize Dong*, Cheng Tan, Jun Xia, Bozhen Hu, Stan Z. Li
Published on The 41st International Conference on Machine Learning (ICML 2024).
Can we model Non-Euclidean graphs as pure language or even Euclidean vectors while retaining their inherent information? The Non-Euclidean property have posed a long term challenge in graph modeling. Despite recent graph neural networks and graph transformers efforts encoding graphs as Euclidean vectors, recovering the original graph from vectors remains a challenge. In this paper, we introduce GraphsGPT, featuring an Graph2Seq encoder that transforms Non-Euclidean graphs into learnable GraphWords in the Euclidean space, along with a GraphGPT decoder that reconstructs the original graph from GraphWords to ensure information equivalence. We pretrain GraphsGPT on $100$M molecules and yield some interesting findings:
- The pretrained Graph2Seq excels in graph representation learning, achieving state-of-the-art results on
$8/9$ graph classification and regression tasks. - The pretrained GraphGPT serves as a strong graph generator, demonstrated by its strong ability to perform both few-shot and conditional graph generation.
- Graph2Seq+GraphGPT enables effective graph mixup in the Euclidean space, overcoming previously known Non-Euclidean challenges.
- The edge-centric pretraining framework GraphsGPT demonstrates its efficacy in graph domain tasks, excelling in both representation and generation.
This is the official code implementation of ICML 2024 paper A Graph is Worth $K$ Words: Euclideanizing Graph using Pure Transformer.
The model checkpoints can be downloaded from 🤗 Transformers. We provide both the foundational pretrained models with different number of Graph Words
Model Name | Model Type | Model Checkpoint |
---|---|---|
GraphsGPT-1W | Foundation Model | |
GraphsGPT-2W | Foundation Model | |
GraphsGPT-4W | Foundation Model | |
GraphsGPT-8W | Foundation Model | |
GraphsGPT-1W-C | Finetuned Model |
To get started with GraphsGPT, please run the following commands to install the environments.
git clone git@github.com:A4Bio/GraphsGPT.git
cd GraphsGPT
conda create --name graphsgpt python=3.12
conda activate graphsgpt
pip install -e .[dev]
pip install -r requirement.txt
We provide some Jupyter Notebooks in ./jupyter_notebooks
, and their corresponding online Google Colaboratory Notebooks. You can run them for a quick start.
Example Name | Jupyter Notebook | Google Colaboratory |
---|---|---|
GraphsGPT Pipeline | example_pipeline.ipynb | |
Graph Clustering Analysis | clustering.ipynb | |
Graph Hybridization Analysis | hybridization.ipynb | |
Graph Interpolation Analysis | interpolation.ipynb |
You should first download the configurations and data for finetuning, and put them in ./data_finetune
. (We also include the finetuned checkpoints in the model_zoom.zip
file for a quick test.)
To evaluate the representation performance of Graph2Seq Encoder, please run:
bash ./scripts/representation/finetune.sh
You can also toggle the --mixup_strategy
for graph mixup using Graph2Seq.
For unconditional generation with GraphGPT Decoder, please refer to README-Generation-Uncond.md.
For conditional generation with GraphGPT-C Decoder, please refer to README-Generation-Cond.md.
To evaluate the few-shots generation performance of GraphGPT Decoder, please run:
bash ./scripts/generation/evaluation/moses.sh
bash ./scripts/generation/evaluation/zinc250k.sh
@article{gao2024graph,
title={A Graph is Worth $K$ Words: Euclideanizing Graph using Pure Transformer},
author={Gao, Zhangyang and Dong, Daize and Tan, Cheng and Xia, Jun and Hu, Bozhen and Li, Stan Z},
journal={arXiv preprint arXiv:2402.02464},
year={2024}
}
If you have any questions, please contact:
-
Zhangyang Gao: gaozhangyang@westlake.edu.cn
-
Daize Dong: dzdong2019@gmail.com