Project: Applying Digital Signal Processing: FFT, Spectrograms, Wavelets and Machine Learning/Deep Learning to genome classification.
In this project we develop effective methods for classifying genomes (DNA sequences) based on Digital Signal Processing, Machine Learning, Deep Learning. This is on-going research and results will be published on a regular basis.
As a starting point we analyzed the following paper:
"ML-DSP: Machine Learning with Digital Signal Processing for ultrafast, accurate, and scalable genome classification at all taxonomic levels" by Gurjit S. Randhawa , Kathleen A. Hill and Lila Kari. https://doi.org/10.1186/s12864-019-5571-y
Department of Computer Science, University of Western Ontario, London, ON, Canada
Their DNA sequence classification method: ML-DSP is very effective and matches or outperforms the acuuracy of best existing methods with DNA sequence alignment.
They collected a large number of DNA sequences, and built many datasets for validation purposes: Vertebrates, Fungi, Insects...
They thourughly report many classification results. That we will use as reference for our research.
Our objective is to develop methods that outperform the accuracy of ML-DSP with the most challenging datasets: Fungi, Protists, Insects.
All datasets are available from Dr. Gurjit S. Randhawa Github repository:
https://github.com/grandhawa/MLDSP/tree/master/DataBase/
Data was extracted from the National Library of Medicine, National Center for Biotechnology Information (NCBI) website. The database can be searched and mitochondrial genome of various live species: plants, fungi... can be downloaded.
Illustration with : https://www.ncbi.nlm.nih.gov/labs/gquery/all/?term=NC_001224.1
The authors propose (we cite) a novel combination of supervised Machine Learning with Digital Signal Processing, resulting in ML-DSP: an alignment-free software tool for ultrafast, accurate, and scalable genome classification at all taxonomic levels. They test ML-DSP by classifying 7396 full mitochondrial genomes at various taxonomic levels, from kingdom to genus, with an average classification accuracy of > 97%.
Their original ML-DSP approach consists of: "The main idea behind ML-DSP is to combine supervised machine learning techniques with digital signal processing, for the purpose of DNA sequence classification. More precisely, for a given set
$S={S_1,S_2,…,S_n}$ of n DNA sequences, ML-DSP uses:
DNA numerical representations.
Discrete Fourier Transform (DFT) of DNA numerical representations and extraction of spectrum magintudes
$M_{i}$
Pearson Correlation Coefficient (PCC) to compute the distance matrix of all pairwise distances for each pair of magnitude spectra
$(M_i,M_j)$ , where 1≤i,j≤n
Supervised Machine Learning classifiers which take the pairwise distance matrix for a set of sequences.
Their results show that ML-DSP overwhelmingly outperforms the alignment-based software MEGA7 (alignment with MUSCLE or CLUSTALW) in terms of processing time, while having comparable classification accuracies for small datasets and superior accuracies for the large dataset. Compared with the alignment-free software FFP (Feature Frequency Profile), ML-DSP has significantly better classification accuracy than method and is overall faster.
Methodology:
In the initial ML-FFT implementation we achieved 100% accuracy with the vertebrate dataset "birds-fish-mammals" by:
- selecting the first NFFT=1024 points of each DNA sequence,
- applying window and a very low frequency high-pass filter
- feeding one-sided spectrum (frequency response) to Machine Learning classification algorithms: Logistic Regression, SVM.
This simple method did not work with more challenging datasets like the Fungi dataset.
dataset
The Birds - Fishes - Mammals DNA sequences dataset is available here:
https://github.com/grandhawa/MLDSP/tree/master/DataBase/Birds-Fish-Mammals
Class | Genomes (count) |
---|---|
Birds | 553 |
Fish | 874 |
Mammals | 2313 |
With this dataset, the authors achieve 100% accuracy with the ML-DSP method ! Results with our classification method are presented below.
Results
We selected the first NFFT=256, 512, 1024, 2048 in each DNA sequence and then computed the FFT spectrum.
We tested the ML-FFT approach with and without the spectrum phase. Some results are reported in the table below. It looks like the phase add some value with very short sequences NFFT=256, 512.
Optimal results were achieved with NFFT=1024 and 2048. The phase was not instrumental.
Without any particular pre-processing we achieve an accuracy close to 100% with Logistic Regression and SVM.
Like the authors, to measure the performance of such a classifier, we optimized hyperparameters and used the 10-fold cross-validation technique.
Approach | Accuracy | ML Technique |
---|---|---|
First NFFT= 512 points | 97-98% | SVM rbf kernel |
First NFFT= 1024 Magnitude+Phase (1024 features) | ~100% | SVM linear kernel |
First NFFT= 1024 Magnitude only (512 features) | ~100% | SVM rbf kernel |
We display the best result below.
Classification report |
Confusion Matrix |
- The DNA sequence classification of vertebrates, with three different classes is not really a challenge.
- Classification within a same class i.e datasets: Fungi, Insects, is much more challenging.
- We tested other ML models: K Nearest Neighbor(KNN), Random Forest Classifier, Decision Tree with accuracy in the 96-98% range. Not as efficient as SVM. They are left as an exercise.
Jupyter Notebook:
Methodology
The methodology is described in the Jupyter Notebook Part IIa.
Fungi dataset
It is available here:
https://github.com/grandhawa/MLDSP/tree/master/DataBase/Fungi
Phylum | Genomes (count) |
DNA sequence (min-max length) |
---|---|---|
Basidiomycota | 30 | 9745 / 235849 |
Pezizomycotina | 104 | 1364 / 203051 |
Saccharomycotina | 90 | 18844 / 107123 |
We collected additional DNA sequences on the NCBI website and we created extra datasets for running the soft-align procedure and identifying reference DNA frames.
The additional datasets are saved in folders:
- A_Basidiomycota
- A_Pezizomycotina
- A_Saccharomycotina
A zip file of the amended fungi dataset folder: Fungi.zip
can be dowloaded in the current repository.
This folder is required for running the following Jupyter notebook: "Part II a: Matching Fungi DNA sub-sequences with cross-correlation".
Results
For the challenging Fungi dataset, the simple ML-FFT method does not work. We introduce a soft alignment method ("Soft Align") where:
- all frames length NFFT= 1024 points
- we select a NFFT reference frame in each Fungi phylum (sub-phylum). Three reference frames are identified.
- we select an optimal NFFT frame for each DNA sequence in each Fungi phylum. By comparison (cross-correlation) with the three reference frames.
- then we applied the simple ML-FFT method on optimal NFFT frames .
The ML-FFT + Soft Align outperforms ML-DSP with accuracy between 96 to 98% in a reasonable time.
ML-FFT + Soft Align Approach | Accuracy | ML Technique |
---|---|---|
NFFT= 1024 points Magnitude only (512 features) | 96% | Logistic Regression newton-cg solver SVM "rbf" solver |
NFFT= 1024 points Magnitude + Phase (1024 features) | 98% | Logistic Regression newton-cg solver |
Classification report |
Confusion Matrix |
Jupyter Notebooks:
Soft Align method with cross-correlation. Determination of reference frames for each class.
Full classification after importing reference frames for each class.
Excellent results were achieved using the method described in the previous section. This part is left as an exercise.
We used a subset of the Insect DNA sequence dataset for the soft-align process. Which may introduce a bias.
Collecting additional sequences on the website of the National Center for Biotechnology Information (NCBI) was tedious.
We may post the Jupyter notebook in the future.
Results
ML-FFT + Soft Alignement applied to insect dataset.
Classification report |
Confusion Matrix |
We did not test the ML-FFT + Soft Alignement method with the small protists dataset. More data is needed for the initial stage: DNA sequence alignement and definition of the reference DNA frames.