Skip to content
This repository has been archived by the owner on Feb 16, 2019. It is now read-only.

Searching for genes by homology with other genes

mattb112885 edited this page Apr 17, 2013 · 8 revisions

Finding genes homologous to a specific gene

Recall that you can get the ITEP gene ID for a gene of interest with specific function or locus tag by using the db_getGenesWithAnnotation.py function:

$ db_getGenesWithAnnotation.py "Cbei_1843"

fig|290402.1.peg.1824 1-phosphofructokinase_YP_001308970.1_Cbei_1843

Many ITEP scripts are designed to take output such as this (which is by default printed to stdout) and use them as inputs to another script using pipes (|). One such script is the script db_getBlastResultsContainingGenes.py function, which identifies all genes homologous to one or more input genes. We connect the results of the previous script with this new one by using the following syntax:

$ db_getGenesWithAnnotation.py "Cbei_1843" | db_getBlastResultsContainingGenes.py -g 1

Which produces an output like this (only one line shown here)

fig|290402.1.peg.1824 fig|290402.1.peg.1824 100.0 300 0 0 1 300 1 300 5e-173 600.0 600.0 600.0

This is the standard (-m9) tab-delimited output from BLAST with some additional information added to the end. The table consists of in order: query gene, target gene, percent ID, length of alignment, number of mismatches, number of gap openings, query start, query end, target start, target end, E-value, bitscore, query self-bitscore and target self-bitscore.

The script by default gives you BLASTP results - you can also get BLASTN results with identical format by using the -n flag:

$ db_getGenesWithAnnotation.py "Cbei_1843" | db_getBlastResultsContainingGenes.py -g 1 -n

fig|290402.1.peg.1824 fig|290402.1.peg.1824 100.0 903 0 0 1 903 1 903 0.0 1629.0 1629.0 1629.0

You can specify an E-value cutoff for results to display with the -c flag.

Finally, if the input file has the gene IDs you want to query against in a different column, change the value of -g to reflect that. The -g argument is optional if the gene ID is in the first column as described in the help text (Use the -h flag with any python script to get help text).

Finding homology between a set of genes

Suppose you have a set of genes and want to identify whether or not they are strongly homologous. For example, Clostridium beijerinckii has three genes annotated as 6-phosphofructokinase, as we can see by searching for this annotation:

$ db_getGenesWithAnnotation.py "6-phosphofructokinase" | grep "Cbei"

fig|290402.1.peg.581 6-phosphofructokinase_YP_001307727.1_Cbei_0584 fig|290402.1.peg.992 6-phosphofructokinase_YP_001308138.1_Cbei_0998 fig|290402.1.peg.4768 6-phosphofructokinase_YP_001311914.1_Cbei_4852

You can tell whether these genes are actually homologous (and how strongly so they are) using the db_getBlastResultsBetweenSpecificGenes.py function (note the difference between this and the db_getBlastResultsContainingGenes.py function). Use pipes to link the output from the above function with this one:

$ db_getGenesWithAnnotation.py "6-phosphofructokinase" | grep "Cbei" | db_getBlastResultsBetweenSpecificGenes.py

One line in the output is

fig|290402.1.peg.581 fig|290402.1.peg.4768 27.55 294 167 12 1 279 1 263 9e-13 68.6 833.0 637.0

fig|290402.1.peg.992 fig|290402.1.peg.4768 37.97 345 178 5 5 349 1 309 4e-60 225.0 723.0 637.0

Suggesting that the two copies fig|290402.1.peg.581 and fig|290402.1.peg.4768 are not very strongly homologous, while fig|290402.1.peg.992 and fig|290402.1.peg.4768 are more closely related.

Building a homologous set from a seed protein

We can combine the two analyses above to first obtain a list of homologous proteins to a given protein, and then find all of the homologies between them to make a cohesive set. Visualizing this data (see "Visualizing homology evidence") with specific homology scoring metrics is a useful way to see how the proteins are likely to cluster without actually performing the clustering.

To make a cohesive set of BLASTP results between all proteins that are similar to Cbei_1843 within an E-value of 1E-20, the following commands are sufficient:

$ db_getGenesWithAnnotation.py "Cbei_1843" | db_getBlastResultsContainingGenes.py -g 1 -c 1E-20 | db_getBlastResultsBetweenSpecificGenes.py -g 1

Clone this wiki locally