-
Notifications
You must be signed in to change notification settings - Fork 15
Searching for gene families by presence and absence patterns
The database stores a pre-computed table of which genes are present and absent in each organism for each of the clusters in the database. To access this table (or subsets of it) use the db_getPresenceAbsenceTable.py function. Just calling the function without any arguments prints out the entire table (with every cluster run and every cluster). Only the first two rows are shown in the example below:
$ db_getPresenceAbsenceTable.py
runid clusterid annote Clostridium_beijerinckii_NCIMB_8052 Clostridium_novyi_NT Acetobacterium_woodii_DSM_1030
all_I_2.0_c_0.4_m_maxbit 1424 hypothetical protein_YP_001311274.1_Cbei_4208 fig|290402.1.peg.4128 NONE fig|931626.1.peg.2672
As you can see, by default, the function prints out the ITEP IDs in each organism for each cluster - to instead get Boolean values (0 for absent and 1 for present) just call the function with the -b flag:
$ db_getPresenceAbsenceTable.py -b
runid clusterid annote Clostridium_beijerinckii_NCIMB_8052 Clostridium_novyi_NT Acetobacterium_woodii_DSM_1030
all_I_2.0_c_0.4_m_maxbit 1424 hypothetical protein_YP_001311274.1_Cbei_4208 1 0 1
The function also lets you print out the number of gene representatives in each cluster\run pair for each organism, or pull out subsets by run ID or by cluster ID (see help text for details). Finally, if you have an organism tree with sanitized organism names as leaf names (see Building a concatinated gene tree for how to build such a tree), you can tell the program to sort the columns in a way that makes sense in light of the tree's branching order (so that for example organisms of the same species will group together) using the -t flag.
The script db_findClustersByOrganismList.py supports finding clusters with different combinations of four bulk properties, ALL, ANY, ONLY, and NONE, with respect to a given list of organisms (the "ingroup") and the rest of the organisms in a specific cluster run (the "outgroup"). The below table specifies the possible combinations and what they represent.
Property | Ingroup | Outgroup | Meaning
+-----------+---------+-----------+----------
ALL | == N | >= 0 | Conserved genes in the specified list
+-----------+---------+-----------+----------
ANY | >= 1 | >= 0 | Genes present in the specified list
+-----------+---------+-----------+----------
ONLY | >= 1 | == 0 | Genes unique to the specified list
+-----------+---------+-----------+----------
NONE | == 0 | >= 1* | Genes absent from the specified list
+-----------+---------+-----------+-----------
ALL + ONLY | == N | == 0 | Genes that are conserved and found only in the specified list
+-----------+---------+-----------+-----------
ANY + ONLY | >= 1 | == 0 | Genes that are found only in the specified list
+-----------+---------+-----------+-----------
ALL + NONE |
ANY + NONE | Contradictions (raise errors).
ONLY + NONE|
+-----------+---------+-----------
You can also specify a UNIQUE property which enforces that the members in the ingroup must be unique.
Create a file with the two lines in it:
Clostridium beijerinckii NCIMB 8052
Clostridium novyi NT
Save it as "Clostridia_names.txt". Then run the following:
$ cat "Clostridia_names.txt" | db_findClustersByOrganismList.py -a -s all_I_2.0_c_0.4_m_maxbit
You get a list of all of the genes clusters that are found in both of our Clostridia species in our test database (-a) but not in Acetobacterium woodii (-s). By doing a line count (wc -l) we find that 568 families have this property including the following pair:
all_I_2.0_c_0.4_m_maxbit 996
We can find out what genes are in this cluster by using the db_getGenesInClusters.py function, or to find all of their functions and sequences directly use the db_getClusterGeneInformation.py function:
$ makeTabDelimitedRow.py "all_I_2.0_c_0.4_m_maxbit" "996" | db_getClusterGeneInformation.py
fig|290402.1.peg.3932 Clostridium beijerinckii NCIMB 8052 290402.1 290402.1 290402.1.NC_009617.1 4607337 4605538 - -1 methyl-accepting chemotaxis sensory transducer_YP_001311078.1_Cbei_4012 ATGAAAG... MKGTVVS... all_I_2.0_c_0.4_m_maxbit 996
fig|386415.1.peg.124 Clostridium novyi NT 386415.1 386415.1 386415.1.NC_008593.1 120602 122401 + 1 methyl-accepting chemotaxis protein_YP_877097.1_NT01CX_1000 ATGAAAG.... MKGTIVAT... all_I_2.0_c_0.4_m_maxbit 996
Where the DNA and amino acid sequences have been truncated for clarity. Note that as expected both C. beijerinckii and C. novyi are predicted to have this gene, but Acetobacterium is not. We could perform clustering at lower cutoffs (0.4 is rather stringent) and perform similar queries to see if this prediction holds up.
This section requires you to have a tree with sanitized organism names on the leaves. A sanitized ID has everything that is not a letter or number in the organism's name (including spaces) replaced with an underscore. See the Building a concatinated gene tree tutorial for more details on how to generate such a file.
We have provided a function "makeCoreClusterAnalysisTree.py" that performs presence-absence analysis for every clade in a tree using the same methods as shown above. It produces the number of gene families with the specified properties and displays them on a tree. Optionally, it also produces an Excel file containing the runID\cluster ID pairs and a representative annotation (the one with the highest number of genes) for each cluster identified at each node in the tree (the names of the sheets correspond to the node labels on the tree), allowing quick identification of interesting gain and loss patterns at different stages of evolution.