-
Notifications
You must be signed in to change notification settings - Fork 15
Searching for genes by homology with other genes
Recall that you can get the ITEP gene ID for a gene of interest with specific function or locus tag by using the db_getGenesWithAnnotation.py function:
$ db_getGenesWithAnnotation.py "Cbei_1843"
fig|290402.1.peg.1824 1-phosphofructokinase_YP_001308970.1_Cbei_1843
Many ITEP scripts are designed to take output such as this (which is by default printed to stdout) and use them as inputs to another script using pipes (|). One such script is the script db_getBlastResultsContainingGenes.py function, which identifies all genes homologous to one or more input genes. We connect the results of the previous script with this new one by using the following syntax:
$ db_getGenesWithAnnotation.py "Cbei_1843" | db_getBlastResultsContainingGenes.py -g 1
Which produces an output like this (only one line shown here)
fig|290402.1.peg.1824 fig|290402.1.peg.1824 100.0 300 0 0 1 300 1 300 5e-173 600.0 600.0 600.0
This is the standard (-m9) tab-delimited output from BLAST with some additional information added to the end. The table consists of in order: query gene, target gene, percent ID, length of alignment, number of mismatches, number of gap openings, query start, query end, target start, target end, E-value, bitscore, query self-bitscore and target self-bitscore.
The script by default gives you BLASTP results - you can also get BLASTN results with identical format by using the -n flag:
$ db_getGenesWithAnnotation.py "Cbei_1843" | db_getBlastResultsContainingGenes.py -g 1 -n
fig|290402.1.peg.1824 fig|290402.1.peg.1824 100.0 903 0 0 1 903 1 903 0.0 1629.0 1629.0 1629.0
You can specify an E-value cutoff for results to display with the -c flag.
Finally, if the input file has the gene IDs you want to query against in a different column, change the value of -g to reflect that. The -g argument is optional if the gene ID is in the first column as described in the help text (Use the -h flag with any python script to get help text).
Suppose you have a set of genes and want to identify whether or not they are strongly homologous. For example, Clostridium beijerinckii has three genes annotated as 6-phosphofructokinase, as we can see by searching for this annotation:
$ db_getGenesWithAnnotation.py "6-phosphofructokinase" | grep "Cbei"
fig|290402.1.peg.581 6-phosphofructokinase_YP_001307727.1_Cbei_0584 fig|290402.1.peg.992 6-phosphofructokinase_YP_001308138.1_Cbei_0998 fig|290402.1.peg.4768 6-phosphofructokinase_YP_001311914.1_Cbei_4852
You can tell whether these genes are actually homologous (and how strongly so they are) using the db_getBlastResultsBetweenSpecificGenes.py function (note the difference between this and the db_getBlastResultsContainingGenes.py function). Use pipes to link the output from the above function with this one:
$ db_getGenesWithAnnotation.py "6-phosphofructokinase" | grep "Cbei" | db_getBlastResultsBetweenSpecificGenes.py
Two lines in the output are
fig|290402.1.peg.581 fig|290402.1.peg.4768 27.55 294 167 12 1 279 1 263 9e-13 68.6 833.0 637.0
fig|290402.1.peg.992 fig|290402.1.peg.4768 37.97 345 178 5 5 349 1 309 4e-60 225.0 723.0 637.0
Suggesting that the two copies fig|290402.1.peg.581 and fig|290402.1.peg.4768 are not very strongly homologous, while fig|290402.1.peg.992 and fig|290402.1.peg.4768 are more closely related.
We can combine the two analyses above to first obtain a list of homologous proteins to a given protein, and then find all of the homologies between them to make a cohesive set. Visualizing this data (see "Visualizing homology evidence") with specific homology scoring metrics is a useful way to see how the proteins are likely to cluster without actually performing the clustering.
To make a cohesive set of BLASTP results between all proteins that are similar to Cbei_1843 within an E-value of 1E-20, the following commands are sufficient:
$ db_getGenesWithAnnotation.py "Cbei_1843" | db_getBlastResultsContainingGenes.py -g 1 -c 1E-20 | db_getBlastResultsBetweenSpecificGenes.py -g 1