Skip to content

Feature Extraction Approaches for Biological Sequences: A Comparative Study of Mathematical Models

Notifications You must be signed in to change notification settings

Bonidia/FeatureExtraction_BiologicalSequences

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Feature Extraction Approaches for Biological Sequences: A Comparative Study of Mathematical Features

As consequence of the various genomic sequencing projects, an increasing volume of biological sequence data is being produced. Although Machine learning algorithms have been successfully applied to a large number of genomic sequence related problems, the results are largely affected by the type and number of features extracted. This effect has motivated new algorithms and pipeline proposals, mainly involving feature extraction problems, in which extracting significant discriminatory information from a biological set is challenging. Considering this, our work proposes a new study of feature extraction approaches based on mathematical features (Numerical mapping with Fourier, Entropy, and Complex Networks). As a case study, we analyze Long Non-Coding RNA sequences. Moreover, we separated this work into three studies. (I) We assessed our proposal with the most addressed problem in our review, e.g., lncRNA and mRNA; (II) We also validate the mathematical features in different classification problems, to predict the class of lncRNA, e.g., circular RNAs sequences; (III) We analyze its robustness in scenarios with imbalanced data. The experimental results demonstrated three main contributions: (1) An in-depth study of several mathematical features; (2) a new feature extraction pipeline; and (3) its high performance and robustness for distinct RNA sequence classification.

Authors

  • Robson Parmezan Bonidia, Lucas Dias Hiera Sampaio, Douglas Silva Domingues, Alexandre Rossi Paschoal, Fabrı́cio Martins Lopes, André Carlos Ponce de Leon Ferreira de Carvalho, Danilo Sipoli Sanches

Correspondence: robservidor@gmail.com

Publication

Article: Robson P Bonidia, Lucas D H Sampaio, Douglas S Domingues, Alexandre R Paschoal, Fabrício M Lopes, André C P L F de Carvalho, Danilo S Sanches, Feature extraction approaches for biological sequences: a comparative study of mathematical features, Briefings in Bioinformatics, 2021;, bbab011, https://doi.org/10.1093/bib/bbab011.

@article{10.1093/bib/bbab011,
    author = {Bonidia, Robson P and Sampaio, Lucas D H and Domingues, Douglas S and Paschoal, Alexandre R and Lopes, Fabrício M and de Carvalho, André C P L F and Sanches, Danilo S},
    title = "{Feature extraction approaches for biological sequences: a comparative study of mathematical features}",
    journal = {Briefings in Bioinformatics},
    year = {2021},
    month = {02},
    issn = {1477-4054},
    doi = {10.1093/bib/bbab011},
    url = {https://doi.org/10.1093/bib/bbab011},
    note = {bbab011},
    eprint = {https://academic.oup.com/bib/advance-article-pdf/doi/10.1093/bib/bbab011/36254240/bbab011.pdf},
}

Table of contents

List of files

  • examples: Files of Example;
  • methods: Main Files - Feature Extraction Models, e.g., Fourier, Numerical Mapping, Entropy, Complex Networks;
  • preprocessing: Preprocessing Files;
  • README: Documentation;
  • requirements: Dependencies.

Dependencies

  • Python (>=3.7.3)
  • Biopython
  • Igraph
  • NumPy
  • Pandas
  • SciPy

Installing dependencies and package

It is important to note that we consider that the Python language is installed. Otherwise, access: https://www.python.org/downloads/release/python-375/.

$ git clone https://github.com/Bonidia/FeatureExtraction_BiologicalSequences FeatureExtraction

$ cd FeatureExtraction

$ pip3 install -r requirements.txt

$ apt-get -y install python3-igraph

Usage and Examples

In this section, 10 feature extraction methods are available: 7 numerical mapping techniques with Fourier transform, 2 techniques with Entropy, and 1 with Complex Networks.

Preprocessing

Before executing any method in this package, it is necessary to run a pre-processing script, to eliminate any noise from the sequences (e.g., other letters as: N, K ...,). To use this script, follow the example below:

Important: This package only accepts sequence files in Fasta format as input to the methods.

Access folder: $ cd FeatureExtraction
 
To run the tool (Example): $ python3.7 preprocessing.py -i input -o output


Where:

-i = Input - Fasta format file, e.g., test.fasta

-o = output - Fasta format file, e.g., output.fasta

Running:

$ python3.7 preprocessing.py -i dataset.fasta -o preprocessing.fasta 

Numerical Mapping with Fourier Transform

To use this model, follow the example below:

To run the code (Example): $ python3.7 methods/FourierClass.py -i input -o output -l label -r representation


Where:

-i = Input - Fasta format file, e.g., test.fasta

-o = output - CSV format file, e.g., test.csv

-l = Label - Dataset Label, e.g., lncRNA, mRNA, sncRNA

-r = representation/mappings, e.g., 1 = Binary, 2 = Z-curve, 3 = Real, 4 = Integer, 5 = EIIP, 6 = Complex Number, 7 = Atomic Number.

Running:

$ python3.7 methods/FourierClass.py -i sequences.fasta -o sequences.csv -l mRNA -r 2

Shannon and Tsallis Entropy

To use this model, follow the example below:

 
To run the tool (Example): $ python3.7 methods/EntropyClass.py -i input -o output -l mRNA -k k-mer -e Entropy


Where:

-i = Input - Fasta format file, e.g., test.fasta

-o = output - CSV format file, e.g., test.csv

-l = Label - Dataset Label, e.g., lncRNA, mRNA, sncRNA

-k = Range of k-mer, e.g., 1-mer (1) or 2-mer (1, 2)

-e = Type of Entropy, E.g., Shannon or Tsallis

Running:

$ python3.7 methods/EntropyClass.py -i sequences.fasta -o sequences.csv -l mRNA -k 10 -e Shannon

Complex Networks

To use this model, follow the example below:

 
To run the tool (Example): $ python3.7 methods/ComplexNetworksClass.py -i input -o output -l mRNA -k kmer -t threshold


Where:

-i = Input - Fasta format file, e.g., test.fasta

-o = output - CSV format file, e.g., test.csv

-l = Label - Dataset Label, e.g., lncRNA, mRNA, sncRNA

-k = k size, e.g., 2, 3 (default = 3 (codon)), 4

-t = threshold size, e.g., 2, 3 (default = 10).

Running:

$ python3.7 methods/ComplexNetworksClass.py -i sequences.fasta -o sequences.csv -l mRNA -k 3 -t 10

Citation

If you use this code in a scientific publication, we would appreciate citations to the following paper:

Article: Robson P Bonidia, Lucas D H Sampaio, Douglas S Domingues, Alexandre R Paschoal, Fabrício M Lopes, André C P L F de Carvalho, Danilo S Sanches, Feature extraction approaches for biological sequences: a comparative study of mathematical features, Briefings in Bioinformatics, 2021;, bbab011, https://doi.org/10.1093/bib/bbab011.

@article{10.1093/bib/bbab011,
    author = {Bonidia, Robson P and Sampaio, Lucas D H and Domingues, Douglas S and Paschoal, Alexandre R and Lopes, Fabrício M and de Carvalho, André C P L F and Sanches, Danilo S},
    title = "{Feature extraction approaches for biological sequences: a comparative study of mathematical features}",
    journal = {Briefings in Bioinformatics},
    year = {2021},
    month = {02},
    issn = {1477-4054},
    doi = {10.1093/bib/bbab011},
    url = {https://doi.org/10.1093/bib/bbab011},
    note = {bbab011},
    eprint = {https://academic.oup.com/bib/advance-article-pdf/doi/10.1093/bib/bbab011/36254240/bbab011.pdf},
}

About

Feature Extraction Approaches for Biological Sequences: A Comparative Study of Mathematical Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages