Feature Extraction Approaches for Biological Sequences: A Comparative Study of Mathematical Features
As consequence of the various genomic sequencing projects, an increasing volume of biological sequence data is being produced. Although Machine learning algorithms have been successfully applied to a large number of genomic sequence related problems, the results are largely affected by the type and number of features extracted. This effect has motivated new algorithms and pipeline proposals, mainly involving feature extraction problems, in which extracting significant discriminatory information from a biological set is challenging. Considering this, our work proposes a new study of feature extraction approaches based on mathematical features (Numerical mapping with Fourier, Entropy, and Complex Networks). As a case study, we analyze Long Non-Coding RNA sequences. Moreover, we separated this work into three studies. (I) We assessed our proposal with the most addressed problem in our review, e.g., lncRNA and mRNA; (II) We also validate the mathematical features in different classification problems, to predict the class of lncRNA, e.g., circular RNAs sequences; (III) We analyze its robustness in scenarios with imbalanced data. The experimental results demonstrated three main contributions: (1) An in-depth study of several mathematical features; (2) a new feature extraction pipeline; and (3) its high performance and robustness for distinct RNA sequence classification.
- Robson Parmezan Bonidia, Lucas Dias Hiera Sampaio, Douglas Silva Domingues, Alexandre Rossi Paschoal, Fabrı́cio Martins Lopes, André Carlos Ponce de Leon Ferreira de Carvalho, Danilo Sipoli Sanches
Correspondence: robservidor@gmail.com
Article: Robson P Bonidia, Lucas D H Sampaio, Douglas S Domingues, Alexandre R Paschoal, Fabrício M Lopes, André C P L F de Carvalho, Danilo S Sanches, Feature extraction approaches for biological sequences: a comparative study of mathematical features, Briefings in Bioinformatics, 2021;, bbab011, https://doi.org/10.1093/bib/bbab011.
@article{10.1093/bib/bbab011,
author = {Bonidia, Robson P and Sampaio, Lucas D H and Domingues, Douglas S and Paschoal, Alexandre R and Lopes, Fabrício M and de Carvalho, André C P L F and Sanches, Danilo S},
title = "{Feature extraction approaches for biological sequences: a comparative study of mathematical features}",
journal = {Briefings in Bioinformatics},
year = {2021},
month = {02},
issn = {1477-4054},
doi = {10.1093/bib/bbab011},
url = {https://doi.org/10.1093/bib/bbab011},
note = {bbab011},
eprint = {https://academic.oup.com/bib/advance-article-pdf/doi/10.1093/bib/bbab011/36254240/bbab011.pdf},
}
- examples: Files of Example;
- methods: Main Files - Feature Extraction Models, e.g., Fourier, Numerical Mapping, Entropy, Complex Networks;
- preprocessing: Preprocessing Files;
- README: Documentation;
- requirements: Dependencies.
- Python (>=3.7.3)
- Biopython
- Igraph
- NumPy
- Pandas
- SciPy
It is important to note that we consider that the Python language is installed. Otherwise, access: https://www.python.org/downloads/release/python-375/.
$ git clone https://github.com/Bonidia/FeatureExtraction_BiologicalSequences FeatureExtraction
$ cd FeatureExtraction
$ pip3 install -r requirements.txt
$ apt-get -y install python3-igraph
In this section, 10 feature extraction methods are available: 7 numerical mapping techniques with Fourier transform, 2 techniques with Entropy, and 1 with Complex Networks.
Before executing any method in this package, it is necessary to run a pre-processing script, to eliminate any noise from the sequences (e.g., other letters as: N, K ...,). To use this script, follow the example below:
Important: This package only accepts sequence files in Fasta format as input to the methods.
Access folder: $ cd FeatureExtraction
To run the tool (Example): $ python3.7 preprocessing.py -i input -o output
Where:
-i = Input - Fasta format file, e.g., test.fasta
-o = output - Fasta format file, e.g., output.fasta
Running:
$ python3.7 preprocessing.py -i dataset.fasta -o preprocessing.fasta
To use this model, follow the example below:
To run the code (Example): $ python3.7 methods/FourierClass.py -i input -o output -l label -r representation
Where:
-i = Input - Fasta format file, e.g., test.fasta
-o = output - CSV format file, e.g., test.csv
-l = Label - Dataset Label, e.g., lncRNA, mRNA, sncRNA
-r = representation/mappings, e.g., 1 = Binary, 2 = Z-curve, 3 = Real, 4 = Integer, 5 = EIIP, 6 = Complex Number, 7 = Atomic Number.
Running:
$ python3.7 methods/FourierClass.py -i sequences.fasta -o sequences.csv -l mRNA -r 2
To use this model, follow the example below:
To run the tool (Example): $ python3.7 methods/EntropyClass.py -i input -o output -l mRNA -k k-mer -e Entropy
Where:
-i = Input - Fasta format file, e.g., test.fasta
-o = output - CSV format file, e.g., test.csv
-l = Label - Dataset Label, e.g., lncRNA, mRNA, sncRNA
-k = Range of k-mer, e.g., 1-mer (1) or 2-mer (1, 2)
-e = Type of Entropy, E.g., Shannon or Tsallis
Running:
$ python3.7 methods/EntropyClass.py -i sequences.fasta -o sequences.csv -l mRNA -k 10 -e Shannon
To use this model, follow the example below:
To run the tool (Example): $ python3.7 methods/ComplexNetworksClass.py -i input -o output -l mRNA -k kmer -t threshold
Where:
-i = Input - Fasta format file, e.g., test.fasta
-o = output - CSV format file, e.g., test.csv
-l = Label - Dataset Label, e.g., lncRNA, mRNA, sncRNA
-k = k size, e.g., 2, 3 (default = 3 (codon)), 4
-t = threshold size, e.g., 2, 3 (default = 10).
Running:
$ python3.7 methods/ComplexNetworksClass.py -i sequences.fasta -o sequences.csv -l mRNA -k 3 -t 10
If you use this code in a scientific publication, we would appreciate citations to the following paper:
Article: Robson P Bonidia, Lucas D H Sampaio, Douglas S Domingues, Alexandre R Paschoal, Fabrício M Lopes, André C P L F de Carvalho, Danilo S Sanches, Feature extraction approaches for biological sequences: a comparative study of mathematical features, Briefings in Bioinformatics, 2021;, bbab011, https://doi.org/10.1093/bib/bbab011.
@article{10.1093/bib/bbab011,
author = {Bonidia, Robson P and Sampaio, Lucas D H and Domingues, Douglas S and Paschoal, Alexandre R and Lopes, Fabrício M and de Carvalho, André C P L F and Sanches, Danilo S},
title = "{Feature extraction approaches for biological sequences: a comparative study of mathematical features}",
journal = {Briefings in Bioinformatics},
year = {2021},
month = {02},
issn = {1477-4054},
doi = {10.1093/bib/bbab011},
url = {https://doi.org/10.1093/bib/bbab011},
note = {bbab011},
eprint = {https://academic.oup.com/bib/advance-article-pdf/doi/10.1093/bib/bbab011/36254240/bbab011.pdf},
}