Skip to content

Latest commit

 

History

History
108 lines (78 loc) · 5.29 KB

File metadata and controls

108 lines (78 loc) · 5.29 KB

Metabolite identification with fused Gromov-Wasserstein

This repository contains a Python implementation of the supervised graph prediction method proposed in [1] for solving the metabolite identification problem.

Metabolite identification problem

An important problem in metabolomics is to identify the small molecules, called metabolites, that are present in a biological sample. Mass spectrometry is a widespread method to extract distinctive features from a biological sample in the form of a tandem mass (MS/MS) spectrum.

Various machine learning methods have been proposed to solve this problem by learning to predict metabolites from mass spectra thanks to a training data set of couples (mass spectrum, molecular structure). This problem belongs to the challenging family of structured prediction learning problems.

Dataset. The data used in [1] are available for download from https://zenodo.org/record/804241#.Yi9bzS_pNhE. It consists in a set of 4138 labeled data also used in [2] to evaluate the performance of metabolite identification from tandem mass spectra. These data have been extracted and processed in [3]. The MS/MS spectra have been extracted from the GNPS public spectral library (https://gnps.ucsd.edu/ProteoSAFe/libraries.jsp). The candidate sets have been build with molecular structures from PubChem.

Quick start

Load data.

from Utils.load_data import load_dataset_kernel_graph

n_tr = 3000
n_te = 1148
D_tr, D_te = load_dataset_kernel_graph(n_tr)
K_tr, Y_tr = D_tr
K_tr_te, K_te_te, Y_te = D_te

Input pre-processing (optional). Centering and normalizing the input kernel improves the statistical performance.

from Utils.metabolites_utils import center_gram_matrix, normalize_gram_matrix

center, normalize = True, True
if center:
    K_tr_te = center_gram_matrix(K_tr_te, K_tr, K_tr_te, K_tr)
    K_tr = center_gram_matrix(K_tr)
if normalize:
    K_tr_te = normalize_gram_matrix(K_tr_te, K_tr, K_te_te)
    K_tr = normalize_gram_matrix(K_tr)

Train.

from Methods.method_gromov_wasserstein import FgwEstimator
from Utils.diffusion import diffuse

clf = FgwEstimator()
clf.ground_metric = 'diffuse'
L = 1e-4  # kernel ridge regularization parameter
clf.tau = 0.6  # the bigger tau is the more the neighbor atoms have similar feature. This impacts the FGW's ground metric.
Y_Tr = diffuse(Y_tr, clf.tau)
clf.train(K_tr, Y_tr, L)

Predict and compute the test scores.

n_bary = 5  # Number of kept alpha_i(x) when predicting
n_c_max = 500   # Do not predict test input with more than n_c_max candidates
fgw, topk, n_pred = clf.predict(K_tr_te, n_bary=n_bary, Y_te=Y_te, n_c_max=n_c_max)

You should obtain the following results:

  • Mean FGW = 0.14209778 ± 0.0460629.
  • Top-1 = 32.6%, Top-10 = 62.7%, Top-20 = 72.5%.
  • Number of predictions = 448.

Reproducing the experiments in [1]

Brogat-Motte et al., 2022 (Section 6.2) experimentally assess the performance of fused Gromov-Wasserstein barycenter for predicting metabolites from mass spectra. In particular, a comparison of the prediction performance of different ground metrics between atoms used in the fused Gromov-Wasserstein distance between molecules (Vayer et al., 2020) is carried out.

These experiments can be reproduced in two steps: 1) hyperparameters selection, 2) test the methods using the selected hyperparameters. It is possible to run directly the step 2).

1) Run hyperparameters selection.

python hyper_param_selection.py method

with method="finger", "gk", "gw_onehot", "gw_fine", "gw_diffuse".

The results are saved in the folder "Results", where one can find already saved results.

2) Test the methods with the selected hyperparameters

python test.py method

The results are saved in the folder "Results", where one can already find saved results.

The Top-k accuracies obtained on the test data are given in the following table.

Top-1 Top-10 Top-20
WL kernel 9.8% 29.1% 37.4%
Linear fingerprint 28.6% 54.5% 59.9%
Gaussian fingerprint 41.0% 62.0% 67.8%
FGW one-hot 12.7% 37.3% 44.2%
FGW fine 18.1% 46.3% 53.7%
FGW diffuse 27.8% 52.8% 59.6%

References

[1] Brogat-Motte, L., Flamary, R., Brouard, C., Rousu, J., d'Alché-Buc, F. Learning to Predict Graphs with Fused Gromov-Wasserstein Barycenters. arXiv preprint arXiv:2202.03813, 2022. (http://arxiv.org/abs/2202.03813)

[2] Brouard, C., Shen, H., Dührkop, K., d'Alché-Buc, F., Böcker, S. and Rousu, J.: Fast metabolite identification with Input Output Kernel Regression. In the proceedings of ISMB 2016, Bioinformatics 32(12): i28-i36, 2016. DOI: https://doi.org/10.1093/bioinformatics/btw246

[3] Dührkop, K., Shen, H., Meusel, M., Rousu, J. and Böcker, S.: Searching molecular structure databases with tandem mass spectra using CSI:FingerID. PNAS, 112(41), 12580-12585, 2015. doi:10.1073/pnas.1509788112

[4] Vayer, T., Chapel, L., Flamary, R., Tavenard, R., and Courty, N. Optimal transport for structured data with application on graphs. In International Conference on Machine Learning (ICML), 2019.