Transfer learning in proteomics: comparison of protein sequence embeddings

This repository contains the data and code used in the review of protein sequence embeddings entitled "Transfer learning in proteomics: comparison of novel learned representations for protein sequences," by E. Fenoy, A. Edera and G. Stegmayer (under review). Research Institute for Signals, Systems and Computational Intelligence, sinc(i).

In the figure above, points depict 2D non-linear projections calculated from the 12 protein sequence embeddings studied. Orange points highlight protein sequences having the Immunoglobulin C1-set domain (PF07654).

The figures above show the performance of the 12 embeddings used for predicting the GO terms annotating protein sequences. Performance is measured with the F1 score and predictions are grouped according to the three sub-ontologies of the GO terms: Biological Process (BP), Cellular Component (CC) and Molecular Function (MF).

Introduction

Recently, representation learning techniques are being proposed for encoding different types of protein information (sequence, domains, interactions, etc.) as low-dimensional vectors. In this review, we performed a detailed experimental comparison of several protein sequence embeddings on several bioinformatics tasks:

determining similarities between proteins in the embeddings projected space.
inferring protein domains.
predicting GO ontology-based protein functions.

Notebook

This notebook reproduces the visual comparative analysis of 12 embeddings in the evaluation of the capability of protein sequence embeddings for capturing protein domain information.

Protein sequence embeddings

The review used 9,479 human protein sequences to build embeddings with 12 embedding methods.

Note: Click the method name below to download the embeddings used in this review.

Embedding	Dimensionality	Reference
CPCProt	512	Lu et al., 2020
DeepGOCNN	8,192	Kulmanov & Hoehndorf, 2019
ESM	1,280	Rives et al., 2021
GP	64	Yang et al., 2018
Plus-RNN	1,024	Min et al., 2021
ProtTrans	1,024	Elnaggar et al., 2021
ProtVec	300	Asgari & Mofrad, 2015
rawMSA	50	Mirabello & Wallner, 2019
RBM	100	Tubiana et al., 2019
SeqVec	1,024	Heinzinger et al., 2019
TAPE	768	Rao et al., 2019
UniRep	1,900	Alley et al., 2019

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
domains		domains
img		img
notebooks		notebooks
seqs		seqs
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transfer learning in proteomics: comparison of protein sequence embeddings

Introduction

Notebook

Protein sequence embeddings

About

Releases

Packages

Languages

sinc-lab/Comparison-of-Protein-learning

Folders and files

Latest commit

History

Repository files navigation

Transfer learning in proteomics: comparison of protein sequence embeddings

Introduction

Notebook

Protein sequence embeddings

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages