This repo contains code submission for OGB challenge. Here, we focus on ogbg-molhiv, which is a binary classification task to predict target molecular property, e.g, whether a molecule inhibits HIV virus replication or not. The evaluation metric is AUROC. To our best knowledge, this is the first solution to directly optimize AUC score in this task. Our AUC-Margin loss improves baseline (DeepGCN) to 0.8159 and achieves SOTA performance 0.8352 when jointly training with Neural FingerPrints. Our approaches are implemented in LibAUC, which is a ML library for AUC optimization.
Our method ranks 1st place as of 10/11/2021 on the leaderboard! We present our results on the ogbg-molhiv dataset with some strong baselines as below:
Method | Test AUROC | Validation AUROC | Parameters | Hardware |
---|---|---|---|---|
DeepGCN | 0.7858±0.0117 | 0.8427±0.0063 | 531,976 | Tesla V100 (32GB) |
DeeperGCN+FLAG | 0.7942±0.0120 | 0.8425±0.0061 | 531,976 | Tesla V100 (32GB) |
Neural FingerPrints | 0.8232±0.0047 | 0.8331±0.0054 | 2,425,102 | Tesla V100 (32GB) |
Graphormer | 0.8051±0.0053 | 0.8310±0.0089 | 47,183,040 | Tesla V100 (16GB) |
DeepAUC (Ours) | 0.8159±0.0059 | 0.8054±0.0080 | 1,019,407 | Tesla V100 (32GB) |
DeepAUC+FPs (Ours) | 0.8352±0.0054 | 0.8238±0.0061 | 1,019,407** | Tesla V100 (32GB) |
- Note that this number** doesn't count the parameters of Random Forest model.
- Install base packages:
Python>=3.7 Pytorch>=1.9.0 tensorflow>=2.0.0 pytorch_geometric>=1.6.0 ogb>=1.3.2 dgl>=0.5.3 numpy==1.20.3 pandas==1.2.5 scikit-learn==0.24.2 deep_gcns_torch
- Install LibAUC (using AUC-Margin loss and PESG optimizer):
pip install LibAUC
The training process has two steps: 1) we train a DeepGCN model using our AUC-margin loss from scratch. 2) we jointly finetuning the pretrained model from (1) with FingerPrints models.
- Train DeepGCN model with AUC-Margin loss and PESG optimizer by default parameters
python main.py --use_gpu --conv_encode_edge --num_layers 14 --block res+ --gcn_aggr softmax --t 1.0 --learn_t --dropout 0.2 \
--dataset ogbg-molhiv \
--loss auroc \
--optimizer pesg \
--batch_size 512 \
--lr 0.1 \
--gamma 500 \
--margin 1.0 \
--weight_decay 1e-5 \
--random_seed 0 \
--epochs 300
- Extract fingerprints and train Random Forest by following PaddleHelix
python extract_fingerprint.py
python random_forest.py
- Finetuning pretrained model with FingerPrints model using AUC-margin loss by default parameters
python finetune.py --use_gpu --conv_encode_edge --num_layers 14 --block res+ --gcn_aggr softmax --t 1.0 --learn_t --dropout 0.2 \
--dataset ogbg-molhiv \
--loss auroc \
--optimizer pesg \
--batch_size 512 \
--lr 0.01 \
--gamma 300 \
--margin 1.0 \
--weight_decay 1e-5 \
--random_seed 0 \
--epochs 100
The results (1) improves the original baseline (DeepGCN) to 0.8159, which is ~3% improvement. The result (2) achieves a higher SOTA performance 0.8352, which is ~1% improvement over previous baselines. For each stage, we train model by 10 times using different random seeds, e.g., 0 to 9.
If you have any questions, please open an new issue in this repo or contact us @ Zhuoning Yuan [yzhuoning@gmail.com]. If you find this work useful, please cite the following paper for our method and library:
@inproceedings{yuan2021robust,
title={Large-scale Robust Deep AUC Maximization: A New Surrogate Loss and Empirical Studies on Medical Image Classification},
author={Yuan, Zhuoning and Yan, Yan and Sonka, Milan and Yang, Tianbao},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
year={2021}
}