From 07e1a6924457e402e517c77c49ae2c8ef9fd7d44 Mon Sep 17 00:00:00 2001 From: Michael Hiller Date: Thu, 30 Mar 2017 19:34:12 +0200 Subject: [PATCH 1/3] Initial commit --- LICENSE | 21 +++++++++++++++++++++ README.md | 1 + 2 files changed, 22 insertions(+) create mode 100644 LICENSE create mode 100644 README.md diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000..73c33fc --- /dev/null +++ b/LICENSE @@ -0,0 +1,21 @@ +MIT License + +Copyright (c) 2017 hillerlab + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. diff --git a/README.md b/README.md new file mode 100644 index 0000000..b464986 --- /dev/null +++ b/README.md @@ -0,0 +1 @@ +# CESAR-2.0 \ No newline at end of file From b5aa6d0bf2f35db653394b6a64c627770788254e Mon Sep 17 00:00:00 2001 From: Michael Hiller Date: Thu, 30 Mar 2017 21:10:23 +0200 Subject: [PATCH 2/3] Update README.md --- README.md | 133 +++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 132 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index b464986..b638019 100644 --- a/README.md +++ b/README.md @@ -1 +1,132 @@ -# CESAR-2.0 \ No newline at end of file +# CESAR 2.0 + +CESAR 2.0 (Coding Exon Structure Aware Realigner 2.0) is a method to realign coding exons or genes to DNA sequences using a Hidden Markov Model [1]. + +Compared to its predecessor [2], CESAR 2.0 is 77X times faster on average (132X times faster for large exons) and requires 30-times less memory. In addition, CESAR 2.0 improves the accuracy of the comparative gene annotation by two new features. First, CESAR 2.0 substantially improves the identification of splice sites that have shifted over a larger distance, which improves the accuracy of detecting the correct exon boundaries. +Second, CESAR 2.0 provides a new gene mode that re-aligns entire genes at once. This mode is able to recognize complete intron deletions and will annotate larger joined exons that arose by intron deletion events. + + + +# Installation +Just call + +`make` + +to build CESAR2. + +The code is commented in doxygen style. +To compile a doxygen documentation of this program at `doc/doxygen/index.html`, call + +`make doc` + +# Running CESAR 2.0 directly +## Minimal example + +Just call + +`./cesar example/example1.fa` + +This will output the re-aligned exon, using the default donor/acceptor profile obtained from human. + + +## Format of the input file +The input file has to be a Fasta file. It provides at least one reference and +at least one query sequence. References and queries have to be separated by a +line starting with '#'. References are the exons (together with their reading frame) that you want to align to the query sequence. + +Example alignment of human exon against a mouse query sequence. +``` +>human +acACGTACGTgt +#### +>mouse +ACGTACGTACGTACGTACGTACGTACGTACGT +``` + +Example alignment of multiple human exons against multiple mouse queries. +``` +>human#0 +acACACGTgt +>human#1 +acACGTGTgt +>human#2 +acACGTACGTgt +#### +>mouse-1 +ACGTACGTACGTACGTACGTACGTACGTACGT +>mouse-2 +ACGTACGTACGTACGTCGTCGTCGTCGTAAAAACGTACGTACGTACGTACGT +``` + + +## Parameters + +`-f/--firstexon` +The default profile for start codons is assigned to the acceptor profile of +the first given exon. + +`-l/--lastexon` +The default profile for stop codons is assigned to the donor profile of the last given exon. + +`-m/--matrix ` +Set `` as the path to the substitution matrix. + +`-p/--profiles ` +Set acceptor and donor profiles to `` resp. ``. + +`-c/--clade ` (default: `human`) +A shortcut to default sets of substitution matrix and profiles. +For example, `-c human` is synonymous to: +`-m extras/tables/human/eth_matrix.txt -p extras/tables/human/acc_profile.txt extras/tables/human/do_profile.txt` + +By default, CESAR2 uses profiles obtained from human. +You can provide profiles for another species in a directory extra/tables/$species and tell CESAR 2.0 to use these profiles by +`./cesar --clade $species test/mocks/example1.fa` + +If contains a slash `/` it will be interpreted as look-up directory for profiles. + +**Note:** With `-l` and/or `-f`, the profiles will change accordingly. + + + +## Special parameters + +`-v/--verbosity ` +Print extra information to stderr. + +n | Information +------------- | ------------- +1 | Input Parameters +2 | List matrices and sequences in memory +3 | Fasta parser and alignment state machine +4 | Emission table initialization and Viterbi path +5 | HMM state creation, transitions and HMM normalization +6 | Full Viterbi step +7 | Initialization and access of emission tables + + +`-i/--split_codon_emissions ` +Manually define the length of split codons for each reference at once. + +**Note:** `-i` is deprecated. Use lower case letters the Fasta file to annotate +split codons and upper case letters for all other codons. Alternatively +separate split codons from full codons with the pipe character `|`. + + +`-s/--set .. ` +Customize parameters, e.g. transition probabilities. + +Use with caution! + + +`-V/--version` +Print the version and exit. + + +# References +CESAR 2.0 was implemented by Peter Schwede (MPI-CBG/MPI-PKS 2017). + +[1] Sharma V, Schwede P, and Hiller M. CESAR 2.0 substantially improves speed and accuracy of comparative gene annotation. Submitted + +[2] Sharma V, Elghafari A, and Hiller M. [Coding Exon-Structure Aware Realigner (CESAR) utilizes genome alignments for accurate comparative gene annotation](https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gkw210). Nucleic Acids Res., 44(11), e103, 2016 + From 73b87b5dcc8963d3b8a889f65aa5885d9b7b5fc8 Mon Sep 17 00:00:00 2001 From: Michael Hiller Date: Thu, 30 Mar 2017 21:27:21 +0200 Subject: [PATCH 3/3] Update README.md --- README.md | 30 ++++++++++++++++-------------- 1 file changed, 16 insertions(+), 14 deletions(-) diff --git a/README.md b/README.md index b638019..a1e1022 100644 --- a/README.md +++ b/README.md @@ -30,32 +30,34 @@ This will output the re-aligned exon, using the default donor/acceptor profile o ## Format of the input file -The input file has to be a Fasta file. It provides at least one reference and +The input file has to be a multi-fasta file. It provides at least one reference and at least one query sequence. References and queries have to be separated by a line starting with '#'. References are the exons (together with their reading frame) that you want to align to the query sequence. Example alignment of human exon against a mouse query sequence. ``` >human -acACGTACGTgt +gCCTGGGAACTTCACCTACCACATCCCTGTCAGTAGTGGCACCCCACTGCACCTCAGCCTGACTCTGCAGATGaa #### >mouse -ACGTACGTACGTACGTACGTACGTACGTACGT +CCTTTAGGCTTGGCAACTTCACCTACCACATCCCTGTCAGCAGCAGCACACCACTGCACCTCAGCCTGACCCTGCAGATGAAGTGAG ``` -Example alignment of multiple human exons against multiple mouse queries. +The reading frame has to be indicated by lower case letters at the beginning and end of the reference exon. Lower case letters are bases belonging to a codon that is split by the intron. In this example, the 'g' is the third codon base and the first full codon is CCT. The 'aa' at the end are the codon bases 2 and 3 of the split codon. + ``` ->human#0 -acACACGTgt ->human#1 -acACGTGTgt ->human#2 -acACGTACGTgt +>human +GTCACAATCATTGGTTACACCCTGGGGATTCCTGACGTCATCATGGGGATCACCTTCCTGGCTGCTGGGACCAGCGTGCCTGACTGCATGGCCAGCCTCATTGTGGCCAGACAAg #### ->mouse-1 -ACGTACGTACGTACGTACGTACGTACGTACGT ->mouse-2 -ACGTACGTACGTACGTCGTCGTCGTCGTAAAAACGTACGTACGTACGTACGT +>mouse +CTCCAAGGTTACCATCATCGGCTACACACTAGGGATCCCTGATGTCATCATGGGGATCACCTTCCTGGCTGCCGGAACCAGCGTGCCAGACTGCATGGCCAGCCTCATTGTAGCCAGACAAGGTGG +>sheep +TCCCAGGTCACGATCATCGGCTACACGCTGGGGATTCCTGACGTCATCATGGGGAGACAAGGTGGGGCCCACGTGGGGAGGGCTGGGAAGGGAAGCCAGGCCTCCCTACTTAGGGGGTAGGGGGAGCTTGCCTGG +``` + +To use the gene mode of CESAR 2.0, provide an input file that lists multiple consecutive or all exons of a gene. +``` +Example ```