Skip to content

sinc-lab/lncRNA-folding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

67 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

lncRNA-folding

This repository contains the source code and data for reproducibility of:

Secondary structure prediction of long non-coding RNA: review and experimental comparison of existing approaches, L.A. Bugnon, A.A. Edera, S. Prochetto, M. Gerard, J. Raad, E. Fenoy, M. Rubiolo, U. Chorostecki, T. Gabaldón, F. Ariel, L. Di Persia, D.H. Milone & G. Stegmayer, Briefings in Bioinformatics, 2022.

abstract

In contrast to messenger RNAs, the function of long non-coding RNAs (lncRNAs) largely depends on their structure, which determines interactions with other molecules. During the last 20 years, classical approaches for predicting RNA secondary structure have been based on dynamic programming and thermodynamics calculations. In the last 4 years, a growing number of machine learning-based models, including deep learning, have achieved breakthrough performance in structure prediction of biomolecules such as proteins and have outperformed classical methods in small RNAs folding.

Nevertheless, the accurate prediction for lncRNA intricate structures is still challenging. The aim of this repository is to serve as a public benchmark, with an unified and consistent experimental setup based on curated structures and probing data of lncRNAs. The repository includes:

  • 19 classical and recent methods for RNA structure prediction.
  • 2 curated datasets of RNA sequences whose reference structures have been experimentally validated by chemical probing methods.
  • 2 metrics for comparing predictions with reference structures represented as biochemical probing scores and dot-bracket notation.

The Mean Average Similarity (MAS) is a novel metric proposed in this study. Unlike classic metrics, MAS assesses predictive performance using probing scores, leading to a more sensitive measurement and allows for comparative analyses not biased by the method used to obtain the reference structures. This score was tested in several cases, which are provided as an interactive notebook.

Datasets

Saccharomyces cerevisiae (sce) dataset

This dataset contains the structural profiles for 3,199 yeast RNAs sequences:

From these sequences, the following three sce subsets are provided:

  • sce3k: all the sequences with more than 200 nt.
  • sce188: sequences obtained from sce3k by identifying non-coding transcripts using the coding potential calculator 2 CPC2.
  • sce18: sequences obtained by taking only those sequences from sce188 that were not previously classified as mRNA.

Curated dataset of lncRNAs

In this dataset, we selected well-characterized lncRNAs from different species.

The following table shows more technical information about the lncRNAs included in the dataset.

Name Species Length Structure Probing methodology Reference
NORAD#1 H. sapiens 1,903 NORAD1_37C nextPARS

Chorostecki et al. 2021

NORAD#2 H. sapiens 1,862 NORAD2_37C nextPARS

Chorostecki et al. 2021

NORAD#3 H. sapiens 1,614 NORAD3_37C nextPARS

Chorostecki et al. 2021

CYRANO H. sapiens 4,419 CYRANO SHAPE

Jones et al. 2020

MEG3 H. sapiens 1,595 MEG3 SHAPE

Uroda et al. 2019

RepA M. musculus 1,630 RepA SHAPE + chemical probing

Liu et al. 2017

PAN Human gammaherpes virus 8 1,077 PAN SHAPE + MaP

SztubaSolinska et al. 2017

XIST M. musculus 17,779 XIST SHAPE + MaP

Smola et al. 2016

lincRNAp21 (sense) H. sapiens 311 lincRNAp21_IRAlu_Sense SHAPE

Chillón & Pyle 2016

lincRNAp21 (antisense) H. sapiens 303 lincRNAp21_IRAlu_Antisense SHAPE

Chillón & Pyle 2016

HOTAIR H. sapiens 2,154 HOTAIR SHAPE + chemical probing

Somarowthu et al. 2015

MALAT1 H. sapiens 8,415 MALAT1 SHAPE

McCown et al. 2019

ROX2 D. melanogaster 573 ROX2 SHAPE + PARS

Ilik et al. 2013

Methods for RNA structure prediction

To facilitate the reproducibility of our experiments, each predictive method evaluated in our study was implemented in a separate Python notebook, which is completely functional and satisfies all the installation requirements. With the exception of SPOT-RNA2, all methods can run in a free Google Collaboratory notebook.

The methods folder contains the implementations of each one of the methods in the comparison.

Method Reference Year Type Repository Web server Notebook
CONTRAFold

Do et al.

2006 Statistical learning

Available

Available

Link

CentroidFold

Sato et al.

2009 Statistical decision theory

Available

Available

Link

ShapeKnots

Deigan et al.

2010 Dynamic programming

Available

Link

ProbKnot

Bellaousov & Mathews

2010 Assembling structures from base-pair probabilities

Available

Available

Link

RNAstructure

Reuter & Mathews

2010 Thermodynamics

Available

Available

Link

RNAfold

Lorenz et al.

2011 Dynamic programming

Available

Available

Link

IPknot

Sato et al.

2011 Integer programming

Available

Available

Link

contextFold

Zakov et al

2011 Structured-prediction learning

Available

Available

Link

RNAshapes

Janssen & Giegerich

2014 Abstract shape analysis

Available

Available

Link

pKiss

Janssen & Giegerich

2014 Abstract shape analysis

Available

Available

Link

SPOT-RNA

Singh et al.

2019 ResNet + biLSTM

Available

Available

Link

LinearFold

Huang et al.

2019 Dynamic programming + statistical learning

Available

Available

Link

LinearPartition

Zhang

2020 Dynamic programming + base pairing probabilities

Available

Available

Link

rna-state-inf

Willmott et al.

2020 Bi-LSTM

Available

Link

SPOT-RNA2

Singh et al.

2021 Ensemble of deep learning models

Available

Available

UFold

Fu et al.

2021 U-net

Available

Available

Link

MXfold2

Sato et al.

2021 Deep learning + thermodynamic parameters

Available

Available

Link

Each method was used to predict the secondary structure for each sequence in the datasets. Resulting structures are available in the predictions folder.

Finally, the metrics definition, prediction comparisons and figures generation are described in the results notebooks:

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published