lncRNA-folding

This repository contains the source code and data for reproducibility of:

Secondary structure prediction of long non-coding RNA: review and experimental comparison of existing approaches, L.A. Bugnon, A.A. Edera, S. Prochetto, M. Gerard, J. Raad, E. Fenoy, M. Rubiolo, U. Chorostecki, T. Gabaldón, F. Ariel, L. Di Persia, D.H. Milone & G. Stegmayer, Briefings in Bioinformatics, 2022.

In contrast to messenger RNAs, the function of long non-coding RNAs (lncRNAs) largely depends on their structure, which determines interactions with other molecules. During the last 20 years, classical approaches for predicting RNA secondary structure have been based on dynamic programming and thermodynamics calculations. In the last 4 years, a growing number of machine learning-based models, including deep learning, have achieved breakthrough performance in structure prediction of biomolecules such as proteins and have outperformed classical methods in small RNAs folding.

Nevertheless, the accurate prediction for lncRNA intricate structures is still challenging. The aim of this repository is to serve as a public benchmark, with an unified and consistent experimental setup based on curated structures and probing data of lncRNAs. The repository includes:

19 classical and recent methods for RNA structure prediction.
2 curated datasets of RNA sequences whose reference structures have been experimentally validated by chemical probing methods.
2 metrics for comparing predictions with reference structures represented as biochemical probing scores and dot-bracket notation.

The Mean Average Similarity (MAS) is a novel metric proposed in this study. Unlike classic metrics, MAS assesses predictive performance using probing scores, leading to a more sensitive measurement and allows for comparative analyses not biased by the method used to obtain the reference structures. This score was tested in several cases, which are provided as an interactive notebook.

Datasets

Saccharomyces cerevisiae (sce) dataset

This dataset contains the structural profiles for 3,199 yeast RNAs sequences:

sce_genes_folded.tab: Sequences and dot-bracket reference structures.
sce_PARS_score.tab: Biochemical probing data obtained from PARS.

From these sequences, the following three sce subsets are provided:

sce3k: all the sequences with more than 200 nt.
sce188: sequences obtained from sce3k by identifying non-coding transcripts using the coding potential calculator 2 CPC2.
sce18: sequences obtained by taking only those sequences from sce188 that were not previously classified as mRNA.

Curated dataset of lncRNAs

In this dataset, we selected well-characterized lncRNAs from different species.

lncRNAs.fasta: lncRNAs sequences and experimentally validated dot-bracket structures.
lncRNAs_probing_scores.csv: Probing scores for each lncRNA sequence using different enzymatic methods.

The following table shows more technical information about the lncRNAs included in the dataset.

Name	Species	Length	Probing methodology	Reference
NORAD#1	H. sapiens	1,903	nextPARS	Chorostecki et al. 2021
NORAD#2	H. sapiens	1,862	nextPARS	Chorostecki et al. 2021
NORAD#3	H. sapiens	1,614	nextPARS	Chorostecki et al. 2021
CYRANO	H. sapiens	4,419	SHAPE	Jones et al. 2020
MEG3	H. sapiens	1,595	SHAPE	Uroda et al. 2019
RepA	M. musculus	1,630	SHAPE + chemical probing	Liu et al. 2017
PAN	Human gammaherpes virus 8	1,077	SHAPE + MaP	SztubaSolinska et al. 2017
XIST	M. musculus	17,779	SHAPE + MaP	Smola et al. 2016
lincRNAp21 (sense)	H. sapiens	311	SHAPE	Chillón & Pyle 2016
lincRNAp21 (antisense)	H. sapiens	303	SHAPE	Chillón & Pyle 2016
HOTAIR	H. sapiens	2,154	SHAPE + chemical probing	Somarowthu et al. 2015
MALAT1	H. sapiens	8,415	SHAPE	McCown et al. 2019
ROX2	D. melanogaster	573	SHAPE + PARS	Ilik et al. 2013

Methods for RNA structure prediction

To facilitate the reproducibility of our experiments, each predictive method evaluated in our study was implemented in a separate Python notebook, which is completely functional and satisfies all the installation requirements. With the exception of SPOT-RNA2, all methods can run in a free Google Collaboratory notebook.

The methods folder contains the implementations of each one of the methods in the comparison.

Method	Reference	Year	Type	Repository	Web server	Notebook
CONTRAFold	Do et al.	2006	Statistical learning	Available	Available	Link
CentroidFold	Sato et al.	2009	Statistical decision theory	Available	Available	Link
ShapeKnots	Deigan et al.	2010	Dynamic programming	Available		Link
ProbKnot	Bellaousov & Mathews	2010	Assembling structures from base-pair probabilities	Available	Available	Link
RNAstructure	Reuter & Mathews	2010	Thermodynamics	Available	Available	Link
RNAfold	Lorenz et al.	2011	Dynamic programming	Available	Available	Link
IPknot	Sato et al.	2011	Integer programming	Available	Available	Link
contextFold	Zakov et al	2011	Structured-prediction learning	Available	Available	Link
RNAshapes	Janssen & Giegerich	2014	Abstract shape analysis	Available	Available	Link
pKiss	Janssen & Giegerich	2014	Abstract shape analysis	Available	Available	Link
SPOT-RNA	Singh et al.	2019	ResNet + biLSTM	Available	Available	Link
LinearFold	Huang et al.	2019	Dynamic programming + statistical learning	Available	Available	Link
LinearPartition	Zhang	2020	Dynamic programming + base pairing probabilities	Available	Available	Link
rna-state-inf	Willmott et al.	2020	Bi-LSTM	Available		Link
SPOT-RNA2	Singh et al.	2021	Ensemble of deep learning models	Available	Available
UFold	Fu et al.	2021	U-net	Available	Available	Link
MXfold2	Sato et al.	2021	Deep learning + thermodynamic parameters	Available	Available	Link

Each method was used to predict the secondary structure for each sequence in the datasets. Resulting structures are available in the predictions folder.

Finally, the metrics definition, prediction comparisons and figures generation are described in the results notebooks:

results_sce: for Saccharomyces cerevisiae dataset.
results_lncRNA: for curated lncRNA dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
data		data
img		img
methods		methods
predictions		predictions
results		results
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

lncRNA-folding

Datasets

Saccharomyces cerevisiae (sce) dataset

Curated dataset of lncRNAs

Methods for RNA structure prediction

About

Releases

Packages

Contributors 2

Languages

License

sinc-lab/lncRNA-folding

Folders and files

Latest commit

History

Repository files navigation

lncRNA-folding

Datasets

Saccharomyces cerevisiae (sce) dataset

Curated dataset of lncRNAs

Methods for RNA structure prediction

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages