In a last decade, RNA sequencing technology and computational methodology have generated huge impetus to riboswitch research.
One of the main challenges raised during classification of riboswitch was imbalanced data.
Previous published classifers all base on untreated imbalanced data, which leads to ignore minority group and emphasize on majority class, consequential return a skewed performance.
This repository includes parts of Machine learning model selection and Performance evaluation (Sensitivity, Specificity and Accuracy, F-score).
-
Read in cleaning riboswitch-kmers matrix csv file as following format:
class kmer1 kmer2 ... kmer N Family name 1 k-mer counting Family name 2 ... ... ... Family name M k-mer counting -
Generate fixed training set and test set and preserve them in home direction
-
10 Fold CV applied in six algorithms to get relative best parameters. The script will preserve all best models, both balanced models and imbalanced models in
Model
folder.
- load models generated by
model_selection.ipynb
- load training set and testing set
- generate classification report in automaticly created folder
classification report
- load models generated by
model_selection
.ipynb - generate confusion matrix and other figures. All preserved in
Figures
folder.
ribo_colormap_produce_kmerfamily.ipynb
ribo_colormap_input_txt.ipynb
feature_selection
seaborn==0.9.0
pandas==0.24.2
PDPbox==0.2.0
shap==0.29.1
numpy==1.16.2
imbalanced_learn==0.4.3
matplotlib==3.0.3
ipython==7.5.0
imblearn==0.0
scikit_learn==0.21.2
Method of installing above packages:
- change directory to the project's home directory which exists the file "requirements.txt"
- entering
pip install -r requirements.txt