This repository contains the code and data for predicting the influence of smoking on DNA methylation at different CpG islands. The code is written in Python and uses the sklearn library for machine learning. The data is stored in csv file format and is available in the same repository.
- Question: How does smoking influence DNA methylation at different CpG islands?
- Data selection and preparation:
- gsm, smoking status, gender, age, and DNA methylation data
- Target variable:
- Smoking status
- Evaluation metrics:
- Accurary
- Precision
- Recall
- F1 score
A dataset was used that includes information on smoking status, gender, age, and methulation values at specific CpG islands. The following steps were performed for data preparation:
- Removal of the 'GSM' column as it was not relevant for the analysis.
- Normalization of the values in the 'Gender' column to make them uniform.
- Imputation of missing values in the methylation columns.
- Coding of categorial variables using LabelEncoder.
- Normalization of methylation data using StandardScale.
To improve the performance and reduction of dimensionality of dataset, attribute selection and reduction were implemented:
- VarianceThreshold: The features with variance zero were removed.
- SelectKBest: The top 10 features were selected.
- PCA: THE PCA was used to reduce the dimensionality to 10 principal components.
A data split was performed to train with 80% of the data and 20% for testing. In this way, if we want to test our model, we use new data that we have not trained with.
The model was optimized using GridSearchCV to find the best hyperparameters for the model.
The following models were used for the prediction:
- Logistic Regression
- Random Forest
- SVM
The results obtained were the following:
-
Logistic Regression (Base Model):
- Accuracy: 0.74
- Precision: 0.75
- Recall: 0.74
- F1 score: 0.66
-
Random Forest (Advanced Model):
- Accuracy: 0.74
- Precision: 0.71
- Recall: 0.74
- F1 score: 0.70
-
SVM (Advanced Model):
- Accuracy: 0.73
- Precision: 0.70
- Recall: 0.73
- F1 score: 0.66
We can see that all three models show similar results, with an accuracy around 74%. THis indicated that all models have comparable performance in terms of correctly classifying smokers and non-smokers. based on DNA methylation.
Although the models shae accuracy, they show differences in these evaluation metrics. The Random Forest model, although it has the same accuracy as Logistic Regression, shows a better F1-sscore, which indicatees that it suggests a better balance between precision and recall.
This model has a lowe accuracy compared to the other models, Its other metrics are also lower, which indicates, in this case, that SVM is not the best model for this dataset.
It has been shown that DNA in specific CpG islands can be used to predict smoking status with reasonable accuracy using different models. It has also been shown that the Random Forest model has shown a better balance between the evaluation metrics, aking it the (slightly) best model.
Optimiztion techniques and several models have also been used to obtain the best possible results. It should also be noted that these models have a lot of room for improvement.
- Use more advanced models for prediction.
- Use more data for training.
- Use more features for prediction.