Skip to content

Latest commit

 

History

History
87 lines (55 loc) · 4.71 KB

How-to-use.md

File metadata and controls

87 lines (55 loc) · 4.71 KB

User Guide:

  1. Installation
  2. Walkthrough of using Codon2Vec

Background:

Codon2Vec runs on the command-line and is compatible with both Windows and Unix operating systems.

A. First time setup instructions

  1. Download python3 (version 3.7 or higher) https://www.python.org/downloads/. Ensure that python is added to your operating system's path.
  2. Download the Codon2Vec repository here: https://github.com/rhondene/Codon2Vec/tree/main/Codon2Vec
  3. Unzip the Codon2Vec folder, and open a terminal window in the uncompressed Codon2Vec folder.
  4. Here you will do a one-time installation of the dependencies Codon2Vec needs to run. On the terminal window type the following command:
       python setup.py install
  1. Exit the Codon2Vec folder Installation is now completed.

B. Walkthrough of using Codon2Vec

1. Example input files

Codon2Vec takes a fasta file of coding sequences and an expression table that is either comma-separated or tab-separated. Below are guidelines for how the input files should be formatted:

a. fasta format:

b. Expression table:

Important guidelines

  • Ensure that the sequence IDs in the fasta file are identical to the sequence IDs in the expression table. However, the fasta and expression table doesn't have to be in the same order or contain the same number of genes .
  • For the fasta file, the program expects that sequence ID immediately follows the fish fin '>' .
  • For the expression table, ensure that the first two columns contain the sequence ID and expression values.
  • 2. Running Codon2Vec on the command-line

    1. Open a terminal in the working folder containing the input files and Codon2Vec package folder like so:

    To see all the available options that modifies the model training, type

           python ./Codon2Vec/ --help

    1. To run Codon2Vec with default options, type this command on the terminal:
    python ./Codon2Vec -CDS some_input.fasta -exprs some_exprs.csv -outfolder results

    Recommendation: Machine learning is an iterative process and the model may converge on a local optima that is not necessarily the best optima. ( How Neural Networks Learn ). So perform model training multiple times to choose the model with best parameterization.

    Setting Seed for Reproducibility: Neural networks are stochastic algorithms by design so the training the same model on the same data yields different results. To improve the stability of results, use the console -seed_num option.

    3. Output:

    Successfully running this program writes the evaluation metrics of the model performance on the hold-out test set to the standard output and a text file.

    Please see the Methods section of the original manuscript that explains each evaluation metric.

    Model Performance Figures: The program also outputs summary figures of model evaluation such as a confusion matrix and a learning curve that compares the model accuracy during training vs validation. Learning curves and confusion matrices are widely used in machine learning to diagnose overfitting or underfitting. ( see this informative blog post ).

    3. Predictions on new sequences

    You have trained your model and are pleased with the model's predictive performance. Now you would like to use the saved model to make predictions on new sequences. To do so, type the following command in your terminal:

    python ./Codon2Vec/predict.py -model your_trained_model -fasta new_seqs.fasta -out name_of_output

    Because of the slightly stochastic nature of the predict() function. I advise that you run the predictions multiple times (at least 10 times) and take the mean or median of the prediction probabilities.