This repository contains the data and code used in the review of protein sequence embeddings entitled "Transfer learning in proteomics: comparison of novel learned representations for protein sequences," by E. Fenoy, A. Edera and G. Stegmayer (under review). Research Institute for Signals, Systems and Computational Intelligence, sinc(i).
In the figure above, points depict 2D non-linear projections calculated from the 12 protein sequence embeddings studied. Orange points highlight protein sequences having the Immunoglobulin C1-set domain (PF07654).
The figures above show the performance of the 12 embeddings used for predicting the GO terms annotating protein sequences. Performance is measured with the F1 score and predictions are grouped according to the three sub-ontologies of the GO terms: Biological Process (BP), Cellular Component (CC) and Molecular Function (MF).
Recently, representation learning techniques are being proposed for encoding different types of protein information (sequence, domains, interactions, etc.) as low-dimensional vectors. In this review, we performed a detailed experimental comparison of several protein sequence embeddings on several bioinformatics tasks:
-
determining similarities between proteins in the embeddings projected space.
-
inferring protein domains.
-
predicting GO ontology-based protein functions.
This notebook reproduces the visual comparative analysis of 12 embeddings in the evaluation of the capability of protein sequence embeddings for capturing protein domain information.
The review used 9,479 human protein sequences to build embeddings with 12 embedding methods.
Note: Click the method name below to download the embeddings used in this review.
Embedding | Dimensionality | Reference |
---|---|---|
512 | ||
8,192 | ||
1,280 | ||
64 | ||
1,024 | ||
1,024 | ||
300 | ||
50 | ||
100 | ||
1,024 | ||
768 | ||
1,900 |