Parkinson's disease is one of the most painful, dangerous and incurable diseases that occur in older people (mainly over 50 years). It concerns the death of dopamine neurons in the brain. This neurodegeneration leads to a range of symptoms, such as coordination issues, slowness of movement, voice changes, stiffness and even progressive disability. So far, there is no cure, although there is medication that offers a significant relief of symptoms, especially in the early stages of the disease. Therefore, it is crucial to develop more sensitive diagnostic tools for detecting the disease, which is the main goal of this repository to discriminate healthy people from those with parkinson disease (PD).
Figure 1. Stages of PD.
In this repository, the dataset is obtained from UCI Machine Learning Repository. This dataset is composed of a range of biomedical voice measurements from 31 people, 23 with PD. Each column in the datset is a particular voice measure, and each row corresponds one of 195 voice recording from these individuals.
name | ASCII subject name and recording number |
---|---|
MDVP:Fo(Hz) | Average vocal fundamental frequency |
MDVP:Fhi(Hz) | Maximum vocal fundamental frequency |
MDVP:Flo(Hz) | Minimum vocal fundamental frequency |
MDVP:Jitter(%) MDVP:Jitter(Abs) MDVP:RAP MDVP:PPQ Jitter:DDP |
Several measures of variation in fundamental frequency |
MDVP:Shimmer MDVP:Shimmer(dB) Shimmer:APQ3 Shimmer:APQ5 MDVP:APQ Shimmer:DDA |
Several measures of variation in amplitude |
NHR HNR |
Two measures of ratio of noise to tonal components in the voice |
status | Health status of the subject (one) - Parkinson's (zero) - healthy |
RPDE D2 |
Two nonlinear dynamical complexity measures |
DFA | Signal fractal scaling exponent |
spread1 spread2 PPE |
Three nonlinear measures of fundamental frequency variation |
Table 1. Attribute Information.
Figure 2. PD and healthy voice instances.
Each person has 6 or 7 voice measurements. For the evaluation of each algorithm taken into account, the dataset was divided into individuals and not at the level of voice measurements. Furthermore, the split of the dataset was performed 10 times, with different people in the train set and test set, with train_size = 0.8
, where it is equivalent to 25 people. Also, The GridSearchCV procedure was applied to find the best hyperparameters of each algorithm (LeaveOneGroupOut method
).
Figure 3. Workflow of the developed module.
ALGORITHMS
- Logistic regression
- Decision Tree classifier
- Gaussian Naive Bayes
- Random Forest
- Support Vector Machine
- XGB classifier
METRICS
Due to the nature of the problem, as a medical, the goal is to reduce positive inaccuracies in the calculation. Either the precision score or the recall do not cover the purpose, as well as the accuracy. Therefore, for better results, the f1-score measure is taken into account, where a balance between precison and recall is sought even in imbalanced classes.
Table 2. Calculated metrics where TP, TN, FP, FN corresponds to True Positives, True Negatives, False Negatives and False Positives, respectively.
Figure 4. Average of the metrics of each classifier.
This project is licensed under the MIT License.