Skip to content

Commit

Permalink
cite
Browse files Browse the repository at this point in the history
  • Loading branch information
nataliarosa9 committed Jan 3, 2024
1 parent d676f1e commit 7b9b3fd
Showing 1 changed file with 107 additions and 55 deletions.
162 changes: 107 additions & 55 deletions templates/about.html
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@

}


</script>
{% endblock%}

Expand All @@ -27,76 +28,119 @@
<div class="col-md-12 text-center text-md-start fs-1 mb-5">
<h1 class="fw-bold mb-4">About - Methods</h1>
<p class="text-justify fs-0">
We developed PRECOGx, a machine learning predictor of GPCR interactions with G-protein and β-arrestin, by using the ESM1b protein embeddings as features and experimental binding datasets.
We developed PRECOGx, a machine learning predictor of GPCR interactions with G-protein and
β-arrestin, by using the ESM1b protein embeddings as features and experimental binding datasets.
</p>
<h2 class="fw-bold mb-4">Embeddings generation</h2>

<p class="text-justify fs-0">Embeddings of the protein sequences were generated by using pre-trained protein language models that have
been recently released. We computed embeddings from fasta sequence using the extract.py function of the <a href= "https://github.com/facebookresearch/esm">ESM library </a>
and by specifying the ESM-1b model (esm1b_t33_650M_UR50S) with embedding for individual amino acids as well as averaged over the full
<p class="text-justify fs-0">Embeddings of the protein sequences were generated by using pre-trained
protein language models that have
been recently released. We computed embeddings from fasta sequence using the extract.py function of
the <a href="https://github.com/facebookresearch/esm">ESM library </a>
and by specifying the ESM-1b model (esm1b_t33_650M_UR50S) with embedding for individual amino acids
as well as averaged over the full
sequence using the option <i>“--include mean per_tok”</i>.
</p>
<p class="text-justify fs-0">We generated embeddings for each individual layers separately, including the final one, by specifying their corresponding
<p class="text-justify fs-0">We generated embeddings for each individual layers separately, including
the final one, by specifying their corresponding
number in the <i>“--repr-layers” option</i>.
</p>
<h2>Data sets</h2>
<p class="text-justify fs-0">
We obtained experimental binding affinities from two distinct sources: TGF assay(12), which captures the binding
affinities of 148 GPCRs with 11 chimeric G-proteins, and the ebBRET assay, which profiles the binding affinities of 97
GPCRs with 12 G-proteins and 3 β-arrestins/GRKs binders, available at <a href="https://gpcrdb.org/">gpcrdb</a>. We also used an integrated
meta-coupling dataset derived from a meta-analysis of the aforementioned assays, entailing binding affinities of 164 GPCRs
for 14 G-proteins. For the TGF assay, we considered a receptor coupled to a G-protein if the logarithm (base 10)
of the relative intrinsic activity (logRAi) was greater than -1, and not-coupled otherwise. Similarly, for the GEMTA assay,
we considered a receptor coupled to a G-protein (or β-arrestins/GRK) if the binding efficacy (dnorm Emax) was greater than 0,
and not-coupled otherwise. For the integrated meta-coupling dataset,
we considered a receptor coupled to a G-protein if the integrated binding affinity was greater than 0, and not-coupled otherwise.
<p class="text-justify fs-0">
We obtained experimental binding affinities from two distinct sources: TGF assay(12), which captures
the binding
affinities of 148 GPCRs with 11 chimeric G-proteins, and the ebBRET assay, which profiles the
binding affinities of 97
GPCRs with 12 G-proteins and 3 β-arrestins/GRKs binders, available at <a href="https://gpcrdb.org/">gpcrdb</a>.
We also used an integrated
meta-coupling dataset derived from a meta-analysis of the aforementioned assays, entailing binding
affinities of 164 GPCRs
for 14 G-proteins. For the TGF assay, we considered a receptor coupled to a G-protein if the
logarithm (base 10)
of the relative intrinsic activity (logRAi) was greater than -1, and not-coupled otherwise.
Similarly, for the GEMTA assay,
we considered a receptor coupled to a G-protein (or β-arrestins/GRK) if the binding efficacy (dnorm
Emax) was greater than 0,
and not-coupled otherwise. For the integrated meta-coupling dataset,
we considered a receptor coupled to a G-protein if the integrated binding affinity was greater than
0, and not-coupled otherwise.
</p>

<img class="pt-md-0 center" src="static/img/gallery/workflow.png" alt="Method workflow"/>
<h2 class="fw-bold mb-4">Model training</h2>
<p class="text-justify fs-0">We developed the new PRECOGx by training multiple models using the protein embeddings
derived from the pre-trained ESM-1b model as features. For every pair of a coupling group (G-protein/β-arrestins)
and assay dataset (TGF/GEMTA assays), we created a training matrix with vectors, each containing the decomposed
PCA values of a receptor embedding along with the binary label (coupled/not-coupled) as the last element. We implemented
the predictor using either a logistic regression or support vector classifier from the <a href="https://scikit-learn.org/">Scikit library</a> library.
<p class="text-justify fs-0">We developed the new PRECOGx by training multiple models using the protein
embeddings
derived from the pre-trained ESM-1b model as features. For every pair of a coupling group
(G-protein/β-arrestins)
and assay dataset (TGF/GEMTA assays), we created a training matrix with vectors, each containing the
decomposed
PCA values of a receptor embedding along with the binary label (coupled/not-coupled) as the last
element. We implemented
the predictor using either a logistic regression or support vector classifier from the <a
href="https://scikit-learn.org/">Scikit library</a> library.
A grid search was performed using
a stratified 5-fold cross validation (CV) to select the best hyperparameters of the classifier.
We repeated the process 10 times to ensure a minimum variance. We generated a total of 34 models per G-protein (or β-arrestin) and assay.
The best models were chosen based on the highest AUC (Area Under the Curve) score during the 5-fold cross-validation.
We repeated the process 10 times to ensure a minimum variance. We generated a total of 34 models per
G-protein (or β-arrestin) and assay.
The best models were chosen based on the highest AUC (Area Under the Curve) score during the 5-fold
cross-validation.
</p>
<h2 class="fw-bold mb-4">Model testing</h2>
<p class="text-justify fs-0">We benchmarked our method against PRECOG, the web-server for GPCR/G-protein coupling predictions that we
previously developed. We obtained an independent list of 117 (<strong>TGF assay</strong> data as the training set),
and 160 receptors (<strong>GEMTA assay</strong> as the training set) from the GtoPdb that are absent in both the assay datasets.
Since <a href="http://www.guidetopharmacology.org/">GtoPdb</a> lacks a proper true negative set, we used Recall (REC) as a measure
<p class="text-justify fs-0">We benchmarked our method against PRECOG, the web-server for GPCR/G-protein
coupling predictions that we
previously developed. We obtained an independent list of 117 (<strong>TGF assay</strong> data as the
training set),
and 160 receptors (<strong>GEMTA assay</strong> as the training set) from the GtoPdb that are absent
in both the assay datasets.
Since <a href="http://www.guidetopharmacology.org/">GtoPdb</a> lacks a proper true negative set, we
used Recall (REC) as a measure
to compare the performance of PRECOGx with PRECOG. To assess over-fitting, we performed
the <a href="https://link.springer.com/article/10.1023/A:1009752403260">randomization test</a> by randomly shuffling the original
labels of the training matrix, while preserving the ratio of the number of coupled to not-coupled receptors.
the <a href="https://link.springer.com/article/10.1023/A:1009752403260">randomization test</a> by
randomly shuffling the original
labels of the training matrix, while preserving the ratio of the number of coupled to not-coupled
receptors.
</p>
<h2 class="fw-bold mb-4">PCA of the GPCRome embedded space</h2>
<p class="text-justify fs-0">We generated embeddings for the human GPCRome, comprising a total of 377
receptors (279 Class A, 15 Class B1, 17 Class B2, 17 class C, 11 class F, 25 Taste receptors and 14 in other classes).
We considered either the embeddings generated by considering all the layers. Embeddings were subjected to
Principal Component Analysis (PCA), using the PCA function from the Scikit library. Each GPCR sequence within
the embedding is annotated with functional information (i) coupling specificities (known from the TGF assay, GEMTA assay,
the GtoPdb, and the STRING (for β-arrestins) databases; (ii) GPCR class membership (known from the GtoPdb).
receptors (279 Class A, 15 Class B1, 17 Class B2, 17 class C, 11 class F, 25 Taste receptors and 14
in other classes).
We considered either the embeddings generated by considering all the layers. Embeddings were
subjected to
Principal Component Analysis (PCA), using the PCA function from the Scikit library. Each GPCR
sequence within
the embedding is annotated with functional information (i) coupling specificities (known from the
TGF assay, GEMTA assay,
the GtoPdb, and the STRING (for β-arrestins) databases; (ii) GPCR class membership (known from the
GtoPdb).
</p>
<h2 class="fw-bold mb-4">Contact analysis</h2>
<p class="text-justify fs-0">To interpret the determinants of binding specificity, we first calculated predicted contacts
for each sequence using a logistic regression over the model's attention maps, available in the ESM library through
the predict_contacts function. Then, the predicted contact maps were grouped on the basis of G-protein binding specificity
<p class="text-justify fs-0">To interpret the determinants of binding specificity, we first calculated
predicted contacts
for each sequence using a logistic regression over the model's attention maps, available in the ESM
library through
the predict_contacts function. Then, the predicted contact maps were grouped on the basis of
G-protein binding specificity
and contrasted to the contact maps of all the GPCRs which was used as a
background. We computed a differential contact maps by calculating a log-odds ratio, employing the following formula:
background. We computed a differential contact maps by calculating a log-odds ratio, employing the
following formula:
</p>
<img class="pt-md-0 center" src="static/img/gallery/formula.png" alt="Math formula"/>
<p class="text-justify fs-0">Where AA and BB terms represent a number of coupled GPCR to a specific G-protein depending on the assay ,
that has or does not have a specific contact pair respectively. CC and DD terms represent the number of not-uncoupled GPCR for a
specific G-protein depending on the assay, that has or does not have a specific contact pair respectively. Contacts contributed
from the loops, N-terminal, and C-terminal of the GPCR were aggregated. Contact pairs considered are those with a probability higher
than 0.5 calculated based on predict_contacts function and those appearing in at least 15% of all GPCR in the assays. We computed log-odds
ratio using the Table2x2 function from <a href="https://www.statsmodels.org/">StatsModels</a>. The resulting log-odds ratio was normalized using the MaxAbsScaler
from Sscikit-learn.Contacts with a positive log-odds ratio (enriched) are seen more frequently in receptors coupled to a specific G-protein,
while contacts with a negative log-odds ratio (depleted) are seen less frequently in receptors coupled to a specific G-protein.
<p class="text-justify fs-0">Where AA and BB terms represent a number of coupled GPCR to a specific
G-protein depending on the assay ,
that has or does not have a specific contact pair respectively. CC and DD terms represent the number
of not-uncoupled GPCR for a
specific G-protein depending on the assay, that has or does not have a specific contact pair
respectively. Contacts contributed
from the loops, N-terminal, and C-terminal of the GPCR were aggregated. Contact pairs considered are
those with a probability higher
than 0.5 calculated based on predict_contacts function and those appearing in at least 15% of all
GPCR in the assays. We computed log-odds
ratio using the Table2x2 function from <a href="https://www.statsmodels.org/">StatsModels</a>. The
resulting log-odds ratio was normalized using the MaxAbsScaler
from Sscikit-learn.Contacts with a positive log-odds ratio (enriched) are seen more frequently in
receptors coupled to a specific G-protein,
while contacts with a negative log-odds ratio (depleted) are seen less frequently in receptors
coupled to a specific G-protein.
</p>
<img class="pt-md-0 center" src="static/img/gallery/contact.png" alt="contact analysis"/>

Expand All @@ -110,17 +154,25 @@ <h2 class="fw-bold mb-4">Attention maps</h2>
<h2 class="fw-bold mb-4">Libraries used</h2>
<p class="text-justify fs-0">Following libraries were used to build the webserver:</p>
<ul>
<li class="text-justify fs-0"> ESM </li>
<li class="text-justify fs-0"> NGL viewer </li>
<li class="text-justify fs-0"> jQuery </li>
<li class="text-justify fs-0"> neXtProt </li>
<li class="text-justify fs-0"> Bootstrap </li>
<li class="text-justify fs-0"> Flask </li>
<li class="text-justify fs-0"> Scikit-learn </li>
<li class="text-justify fs-0"> DataTables </li>
<li class="text-justify fs-0"> Plotly </li>
<li class="text-justify fs-0"> ESM</li>
<li class="text-justify fs-0"> NGL viewer</li>
<li class="text-justify fs-0"> jQuery</li>
<li class="text-justify fs-0"> neXtProt</li>
<li class="text-justify fs-0"> Bootstrap</li>
<li class="text-justify fs-0"> Flask</li>
<li class="text-justify fs-0"> Scikit-learn</li>
<li class="text-justify fs-0"> DataTables</li>
<li class="text-justify fs-0"> Plotly</li>
</ul>
<h5 class="fw-bold mb-4">Contact:</h5>
<h2 class="fw-bold">Cite</h2>
<p class="text-justify fs-0">
Marin Matic, Gurdeep Singh, Francesco Carli, Natalia De Oliveira Rosa, Pasquale Miglionico, Lorenzo
Magni, J Silvio Gutkind, Robert B Russell, Asuka Inoue, Francesco Raimondi, PRECOGx: exploring GPCR
signaling mechanisms with deep protein representations, Nucleic Acids Research, Volume 50, Issue W1,
5 July 2022, Pages W598–W610,
<a href="https://doi.org/10.1093/nar/gkac426"
target="_blank">https://doi.org/10.1093/nar/gkac426</a></p>
<h5 class="fw-bold mb-3">Contact</h5>
<p class="text-justify fs-0">Francesco Raimondi - francesco.raimondi@sns.it</p>
<p class="text-justify fs-0">Marin Matic - marin.matic@sns.it</p>
<p class="text-justify fs-0">Gurdeep Singh - gurdeep.singh@bioquant.uni-heidelberg.de</p>
Expand Down

0 comments on commit 7b9b3fd

Please sign in to comment.