Skip to content

Commit

Permalink
Updated prot-scriber version to 0.1.5
Browse files Browse the repository at this point in the history
- now in 'version(...)' clap will also report 0.1.5
- set it also in Git tags
- and in Cargo.toml
  • Loading branch information
asishallab committed Dec 19, 2023
1 parent aed0b97 commit 7ddaf3e
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ mod seq_sim_table_reader;
/// the `prot-scriber` annotation process and writes the results into the respective output file.
fn main() {
let matches = Command::new("prot-scriber")
.version("version 0.1.4")
.version("version 0.1.5")
.about("\nPLEASE USE '--help' FOR MORE DETAILS!\n\nprot-scriber assigns human readable descriptions (HRD) to query biological sequences or sets of them (a.k.a gene-families).\n")
.after_help("\n\nMANUAL\n======\n\n1. Summary\n----------\n'prot-scriber' uses reference descriptions ('stitle' in Blast terminology) from sequence similarity search results (Blast Hits) to assign short human readable descriptions (HRD) to query biological sequences or sets of them (a.k.a gene, or sequence, families). In this, prot-scriber consumes sequence similarity search (Blast, Diamond, or similar) results in tabular format. A customized lexical analysis is carried out on the descriptions ('stitle' in Blast terminology) of these Blast Hits and a resulting HRD is assigned to the query sequences or query families, respectively.\n\n2. prot-scriber input preparation\n---------------------------------\nThis sections explains how to run your favorite sequence similarity search tool, so that it produces tabular results in the format prot-scriber needs them. You can run sequence similarity searches with Blast [McGinnis, S. & Madden, T. L. BLAST: at the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Res 32, W20–W25 (2004).] or Diamond [Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat Meth 12, 59–60 (2015).]. Note that there are other tools to carry out sequence similarity searches which can be used to generate the input for prot-scriber. As long as you have a tabular text file with the three required columns holding the query identifier, the subject ('Hit') identifier, and the subject ('Hit') description ('stitle' in Blast terminology) prot-scriber will accept this as input.\nDepending on the type of your query sequences the search method and searched reference databases vary. For amino acid queries search protein reference databases, for nucleotide query sequences search nucleotide reference databases. If you have protein coding nucleotide query sequences you can choose to either search protein reference databases using translated nucleotide queries with 'blastx' or 'diamond blastx' or search reference nucleotide databases with 'blastn' or 'diamond blastn'. Note, that before carrying out any sequence similarity searches you need to format your reference databases. This is achieved by either the 'makeblastdb' (Blast) or 'makedb' (Diamond) commands, respectively. Please see the respective tool's (Blast or Diamond) manual for details on how to format your reference sequence database.\n\n2.1 A note on TAB characters\n----------------------------\nTAB is often used as a field separator, e.g. by default in Diamond sequence similarity search result tables, or to separate gene-family identifiers from their respective gene-lists. Consequently, prot-scriber has several arguments that could be a TAB, e.g. the --field-separator (-p) or the --seq-family-id-genes-separator (-i) (please see below for more details on these arguments). Unfortunately providing the TAB character as a command line argument can be tricky. It is even more tricky to write it into a manual like this, because it appears as a blank whitespace and cannot easily be distiunguished from other whitespace characters. We thus write '<TAB>' whenever we mean the TAB character. To type it in the command line and provide it as an argument to prot-scriber you can (i) either use $'\\t' (e.g. -p $'\\t') or (ii) hit Ctrl+v and subsequently hit the TAB key on your keyboard (e.g. -p '\t').\n\n2.2 Which reference databases to search\n---------------------------------------\nFor amino acid (protein) or protein coding nucleotide query sequences we recommend searching UniProt's Swissprot and trEMBL. For nucleotide sequences UniRef100 and, or UniParc might be good choices. Note that you can search _any_ database you deem to hold valuable reference sequences. However, you might have to provide custom blacklist, filter, and capture-replace arguments for Blast or Diamond output tables stemming from searches in these non UniProt databases (see section '3. Technical manual' on the arguments --blacklist-regexs (-b), --filter-regexs (-l), and --capture-replace-pairs (-c) for further details). If you want to search any NCBI reference database, please see section 2.2.1 for more details.\n\n2.2.1 NCBI reference databases\n------------------------------\nThe National Center for Biotechnology Information (NCBI) has excellent reference databases to be searched by Blast or Diamond, too. Note that NCBI and UniProt update each other's databases very frequently. So, by searching UniProt only you should not loose information. Anyway, NCBI has e.g. the popular non redundant ('NR') database. However, NCBI has a different description ('stitle' in Blast terminology) format. To make sure prot-scriber parses sequence similarity search result (Blast or Diamond) tables (SSSTs) correctly, you should use a tailored --filter-regexs (-l) argument. A file containing such a list of regular expressions specifically tailored for parsing SSSTs produced by searching NCBI reference databases, e.g. NR, is provided with prot-scriber. You can download it, and edit it if neccessary, here: https://raw.githubusercontent.com/usadellab/prot-scriber/master/misc/filter_stitle_regexs_NCBI_NR.txt\n\n2.2.2 UniRef reference databases\n------------------------------\nThe UniRef databases (UniProt Reference Clusters) provide clustered sets of sequences from the UniProt Knowledgebase and selected UniParc records to obtain complete coverage of sequence space at several resolutions (100%, 90% and 50% identity) while hiding redundant sequences. The UniRef100 database combines identical sequences and subfragments from any source organism into a single UniRef entry (i.e. cluster). UniRef90 and UniRef50 are built by clustering UniRef100 sequences at the 90% or 50% sequence identity levels. To make sure prot-scriber parses sequence similarity search result (Blast or Diamond) tables (SSSTs) correctly, you should use a tailored --filter-regexs (-l) argument. A file containing such a list of regular expressions specifically tailored for parsing SSSTs produced by searching UniRef databases is provided with prot-scriber. You can download it, and edit it if neccessary, here: https://raw.githubusercontent.com/usadellab/prot-scriber/master/misc/filter_stitle_regexs_UniRef.txt\n\n2.3 Example Blast or Diamond commands\n-------------------------------------\nNote that the following instructions on how to execute your sequence similarity searches with Blast or Diamond only include the information - in terms of selected output table columns - absolutely required by 'prot-scriber'. You are welcome, of course, to have more columns in your tabular output, e.g. 'bitscore' or 'evalue' etc. Note that you need to search each of your reference databases with a separate Blast or Diamond command, respectively.\n\n2.3.1 Blast\n-----------\nGenerate prot-scriber input with Blast as follows. The following example uses 'blastp', replace it, if your query sequence type makes that necessary with 'blastn' or 'blastx'.\n\nblastp -db <reference_database.fasta> -query <your_query_sequences.fasta> -num_threads <how-many-do-you-want-to-use> -out <queries_vs_reference_db_name_blastout.txt> -outfmt \"6 delim=<TAB> qacc sacc stitle\"\n\nIt is important to note, that in the above 'outfmt' argument the 'delim' set to '<TAB>' means you need to actually type in a TAB character. (We write '<TAB>' here, so you see something, not only whitespace.) Typically you can type it by hitting Ctrl+Tab in the terminal.\n\n2.3.2 Diamond\n-------------\nGenerate prot-scriber input with Diamond as follows. The following example uses 'blastp', replace it, if your query sequence type makes that necessary with 'blastn' or 'blastx'.\n\ndiamond blastp -p <how-many-threads-do-you-want-to-use> --quiet -d <reference-database.dmnd> -q <your_query_sequences.fasta> -o <queries_vs_reference_db_name_diamondout.txt> -f 6 qseqid sseqid stitle\n\nNote that diamond by default uses the '<TAB>' character as a field-separator for its output tables.\n\n2.4 Gene Family preparation and analysis\n----------------------------------------\nAssume you have the proteomes of eight crucifer plant species and want to cluster the respective amino acid sequences into gene families. Note that the following example provides code to be executed in a BASH Shell (also available on Windows). We provide a very basic procedure to perform the clustering:\n\n(i) \"All versus all\" Blast or Diamond\n\nAssume all amino acid sequences of the eight example proteomes stored in a single file 'all_proteins.fasta'\nRun:\n\ndiamond makedb --in all_proteins.fasta -d all_proteins.fasta\n\ndiamond blastp --quiet -p <how-many-threads-do-you-want-to-use?> -d all_proteins.fasta.dmnd -q all_proteins.fasta -o all_proteins_vs_all.txt -f 6 qseqid sseqid pident\n\n(ii) Run markov clustering\n\nNote that 'mcl' is a command line tool implementing the original Markov Clustering algorithm [Stijn van Dongen, A cluster algorithm for graphs. Technical Report INS-R0010, National Research Institute for Mathematics and Computer Science in the Netherlands, Amsterdam, May 2000]. On most systems you can install the 'mcl' binary using the respective package manager, e.g. 'sudo apt-get update && sudo apt-get install -y mcl' (Debian / Ubuntu).\n\nmcl all_proteins_vs_all.txt -o all_proteins_gene_clusters.txt --abc -I 2.0\n\n(iii) Add gene family names to mcl output and filter out singleton clusters\n\nNote that we use the GNU tools 'sed' and 'awk' to do some basic post-processing of the 'mcl' output.\n\nsed -e 's/\\t/,/g' all_proteins_gene_clusters.txt | awk -F \",\" 'BEGIN{i=1}{if (NF > 1){print \"Seq-Fam_\" i \"\\t\" $0; i=i+1}}' > all_proteins_gene_families.txt\n\nCongratulations! You now have clustered your eight plant crucifer proteomes into gene families (file 'all_proteins_gene_families.txt').\n\n(iv) Run prot-scriber\n\nWe assume that you ran either 'blastp' or 'diamond blastp' (see section 2.3 for details) to search your selected reference databases with the 'all_proteins.fasta' queries. Here, we assume you have searched UniProt's Swissprot and trEMBL databases.\n\nprot-scriber -f all_proteins_gene_families.txt -s all_proteins_vs_Swissprot_blastout.txt -s all_proteins_vs_trEMBL_blastout.txt -o all_proteins_gene_families_HRDs.txt")
.arg(
Expand Down

0 comments on commit 7ddaf3e

Please sign in to comment.