This repository contains the source code and data for reproducibility of:
Secondary structure prediction of long non-coding RNA: review and experimental comparison of existing approaches, L.A. Bugnon, A.A. Edera, S. Prochetto, M. Gerard, J. Raad, E. Fenoy, M. Rubiolo, U. Chorostecki, T. Gabaldón, F. Ariel, L. Di Persia, D.H. Milone & G. Stegmayer, Briefings in Bioinformatics, 2022.
In contrast to messenger RNAs, the function of long non-coding RNAs (lncRNAs) largely depends on their structure, which determines interactions with other molecules. During the last 20 years, classical approaches for predicting RNA secondary structure have been based on dynamic programming and thermodynamics calculations. In the last 4 years, a growing number of machine learning-based models, including deep learning, have achieved breakthrough performance in structure prediction of biomolecules such as proteins and have outperformed classical methods in small RNAs folding.
Nevertheless, the accurate prediction for lncRNA intricate structures is still challenging. The aim of this repository is to serve as a public benchmark, with an unified and consistent experimental setup based on curated structures and probing data of lncRNAs. The repository includes:
- 19 classical and recent methods for RNA structure prediction.
- 2 curated datasets of RNA sequences whose reference structures have been experimentally validated by chemical probing methods.
- 2 metrics for comparing predictions with reference structures represented as biochemical probing scores and dot-bracket notation.
The Mean Average Similarity (MAS) is a novel metric proposed in this study. Unlike classic metrics, MAS assesses predictive performance using probing scores, leading to a more sensitive measurement and allows for comparative analyses not biased by the method used to obtain the reference structures. This score was tested in several cases, which are provided as an interactive notebook.
This dataset contains the structural profiles for 3,199 yeast RNAs sequences:
- sce_genes_folded.tab: Sequences and dot-bracket reference structures.
- sce_PARS_score.tab: Biochemical probing data obtained from PARS.
From these sequences, the following three sce subsets are provided:
- sce3k: all the sequences with more than 200 nt.
- sce188: sequences obtained from sce3k by identifying non-coding transcripts using the coding potential calculator 2 CPC2.
- sce18: sequences obtained by taking only those sequences from sce188 that were not previously classified as mRNA.
In this dataset, we selected well-characterized lncRNAs from different species.
- lncRNAs.fasta: lncRNAs sequences and experimentally validated dot-bracket structures.
- lncRNAs_probing_scores.csv: Probing scores for each lncRNA sequence using different enzymatic methods.
The following table shows more technical information about the lncRNAs included in the dataset.
To facilitate the reproducibility of our experiments, each predictive method evaluated in our study was implemented in a separate Python notebook, which is completely functional and satisfies all the installation requirements. With the exception of SPOT-RNA2, all methods can run in a free Google Collaboratory notebook.
The methods folder contains the implementations of each one of the methods in the comparison.
Method | Reference | Year | Type | Repository | Web server | Notebook |
---|---|---|---|---|---|---|
CONTRAFold | 2006 | Statistical learning | ||||
CentroidFold | 2009 | Statistical decision theory | ||||
ShapeKnots | 2010 | Dynamic programming | ||||
ProbKnot | 2010 | Assembling structures from base-pair probabilities | ||||
RNAstructure | 2010 | Thermodynamics | ||||
RNAfold | 2011 | Dynamic programming | ||||
IPknot | 2011 | Integer programming | ||||
contextFold | 2011 | Structured-prediction learning | ||||
RNAshapes | 2014 | Abstract shape analysis | ||||
pKiss | 2014 | Abstract shape analysis | ||||
SPOT-RNA | 2019 | ResNet + biLSTM | ||||
LinearFold | 2019 | Dynamic programming + statistical learning | ||||
LinearPartition | 2020 | Dynamic programming + base pairing probabilities | ||||
rna-state-inf | 2020 | Bi-LSTM | ||||
SPOT-RNA2 | 2021 | Ensemble of deep learning models | ||||
UFold | 2021 | U-net | ||||
MXfold2 | 2021 | Deep learning + thermodynamic parameters |
Each method was used to predict the secondary structure for each sequence in the datasets. Resulting structures are available in the predictions folder.
Finally, the metrics definition, prediction comparisons and figures generation are described in the results notebooks:
-
results_sce: for Saccharomyces cerevisiae dataset.
-
results_lncRNA: for curated lncRNA dataset.