- Installation
- Walkthrough of using Codon2Vec
Codon2Vec runs on the command-line and is compatible with both Windows and Unix operating systems.
- Download python3 (version 3.7 or higher) https://www.python.org/downloads/. Ensure that python is added to your operating system's path.
- Download the Codon2Vec repository here: https://github.com/rhondene/Codon2Vec/tree/main/Codon2Vec
- Unzip the Codon2Vec folder, and open a terminal window in the uncompressed Codon2Vec folder.
- Here you will do a one-time installation of the dependencies Codon2Vec needs to run. On the terminal window type the following command:
python setup.py install
- Exit the Codon2Vec folder Installation is now completed.
Codon2Vec takes a fasta file of coding sequences and an expression table that is either comma-separated or tab-separated. Below are guidelines for how the input files should be formatted:
a. fasta format:
b. Expression table:
Important guidelines
- Open a terminal in the working folder containing the input files and Codon2Vec package folder like so:
To see all the available options that modifies the model training, type
python ./Codon2Vec/ --help
- To run Codon2Vec with default options, type this command on the terminal:
python ./Codon2Vec -CDS some_input.fasta -exprs some_exprs.csv -outfolder results
Recommendation: Machine learning is an iterative process and the model may converge on a local optima that is not necessarily the best optima. ( How Neural Networks Learn ). So perform model training multiple times to choose the model with best parameterization.
Setting Seed for Reproducibility: Neural networks are stochastic algorithms by design so the training the same model on the same data yields different results. To improve the stability of results, use the console -seed_num
option.
Successfully running this program writes the evaluation metrics of the model performance on the hold-out test set to the standard output and a text file.
Please see the Methods section of the original manuscript that explains each evaluation metric.
Model Performance Figures: The program also outputs summary figures of model evaluation such as a confusion matrix and a learning curve that compares the model accuracy during training vs validation. Learning curves and confusion matrices are widely used in machine learning to diagnose overfitting or underfitting. ( see this informative blog post ).
You have trained your model and are pleased with the model's predictive performance. Now you would like to use the saved model to make predictions on new sequences. To do so, type the following command in your terminal:
python ./Codon2Vec/predict.py -model your_trained_model -fasta new_seqs.fasta -out name_of_output
Because of the slightly stochastic nature of the predict() function. I advise that you run the predictions multiple times (at least 10 times) and take the mean or median of the prediction probabilities.