-
This repository hosts different Python3 command-line programs for calculating popular codon usage and amino acid usage frequency statistics from FASTA sequence files (.fasta).
-
To use these tools , all that needs to be done is to install python3 and then download the executable binary file (.pyz) from the tool's folder in this repo.
-
Motivation : I worked with hundreds of genomes so I wrote these scripts to handle batch processing of multiple genomes/ input files and outputs a CSV formatted table that is easier to parse and amenable to statistical analysis like PCA - a task that I found tedious because previously published tools would output the conventional wide-form codon usage table that needed extra processing.
-
These Codon Usage tools were validated against the original CodonW software by Peden, 1995
[Interactive Jupyter Notebook version Coming Soon in Summer 2024!]
-
All tools require python3 version 3.8 or higher is installed and pandas version 2.0 or higher.
-
Recommended to install python3 via anaconda https://docs.anaconda.com/anaconda/install/index.html
-
See the
test_data
folder for examples of the outputs of each tool on the same input fasta file ('NB_CDS.fasta')
- Computes relative synonymous codon usage of each 59 degenerate codons per each coding sequence (CDS) according to Sharp and Li, 1986 PMCID: PMC340524
Input
: FASTA file of N coding sequences (CDS)Output
: comma-separated table (csv) of the relative synonymous codon usage for each transcript: i.e. a matrix of N transcripts x 59 RSCU values
How to Use :
- Download the
Compute_RSCU_gene.pyz
binary from the Compute_RSCU_gene github repo into your project folder containing the input FASTA file. - Open a terminal window (bash, gitbash, powershell, etc) in the same working folder.
- Type the following in the terminal, be sure to replace the names of the input and output arguments with your own :
python Compute_RSCU_gene.pyz -CDS example_cds.fasta -out rscu_results
- Also run
python Compute_RSCU_gene.pyz --help
for help menu.
- Computes relative synonymous codon usage (RSCU) and absolute counts of the 59 synonymous codons over the entire set (aggregate) of coding sequences('transcriptome-wide'). Implemented according to Sharp and Li, 1986 PMCID: PMC340524
Input
: single or multifasta file of coding sequences (CDS)Output
: a comma-separated table (.csv) file of the 59 RSCU values
How to Use :
-
Download Compute_RSCU_tw.pyz binary from Compute_RSCU_tw repo into your working folder that contains the input fasta file of CDS.
-
Open a terminal window (bash, gitbash, powershell, etc) in the same working folder.
-
To run the programn, type the command below in the terminal shell (be sure to replace arguments with the actual name the input and output files):
python Compute_RSCU_tw.pyz -CDS example.fasta -out results
Computes the length normalized codon frequency of each 61 sense codons of a coding sequence (CDS), and returns CSV .
Relative Frequency of Codon_i= (frequency of codon_i)/(total number of codons in the CDSj)
How to Use :
- Download the
CodonCount.pyz
file in CodonCount github repo into your working folder with the input fasta file(s). - Open a terminal window (bash, gitbash, powershell, etc) in the same working folder.
- To run the programn, type the command below in the terminal shell (be sure to replace arguments with the actual name the input and output files):
python CodonCount.pyz -CDS example.fasta -out example_output
Also run python CodonCount.pyz --help
for help menu.
Computes codon usage per 1000 of the whole transcriptome.
- Download the
CodonUsage_per_1000.pyz
file in CodonUsage_per_1000 github repo into your working folder with the input fasta file(s). - Open a terminal window (bash, gitbash, powershell, etc) in the same working folder.
- To run the programn, type the command below in the terminal shell (be sure to replace arguments with the actual name the input and output files):
python CodonUsage_per_1000.pyz -CDS all_CDS.fasta -out results_cu
Also run python CodonUsage_per_1000.pyz --help
for help menu.
- Converts fasta file to two-column csv table (Header | Sequence);
-
Computes the Expected and Observed Amino acid usage according to methods outlined in and https://pubmed.ncbi.nlm.nih.gov/5767777/ the https://qubeshub.org/publications/979/serve/1/3067?el=1&download=1
-
To run, download the script in your project folder and type in the terminal
python aa_usage.py -CDS YOUR_CDS.fasta -out OUTPUT_NAME
- Corrects the issue of newlines within the same sequence.
The unequal usage of synonymous codons within a gene or genome i.e. the deviation of synonymous codons from a uniform distribution due to a combination of natural selection, neutral mutational bias and genetic drift.
- If a particular amino acid is in some way adaptive, then it should occur more frequently than expected by chance.
- This can easily be tested by calculating the expected frequencies of amino acids and comparing to observed. The codons and observed frequencies of particular amino acids are given in the table.
- The frequencies of DNA bases in nature are 22.0% uracil, 30.3% adenine, 21.7% cytosine, and 26.1% guanine. The expected frequency of a particular codon can then be calculated by multiplying the frequencies of each DNA base comprising the codon. The expected frequency of the amino acid can then be calculated by adding the frequencies of each codon that codes for that amino acid.
- As an example, the RNA codons for tyrosine are UAU and UAC, so the random expectation for its frequency is (0.220)(0.303)(0.220) + (0.220)(0.303)(0.217) = 0.0292. Since 3 of the 64 codons are nonsense or stop codons, this frequency for each amino acid is multiplied by a correction factor of 1.057.