-
Notifications
You must be signed in to change notification settings - Fork 15
Turning itep ids into human readable formats
An ITEP gene ID consists of three parts:
- fig| - This is somewhat of an artifact of the way the toolkit was developed (which was based on RAST)
- An organism ID that matches the regex "\d+.\d+"
- An identifier for the gene itself in the form "peg.\d+"
An example of an ITEP gene ID is fig|290402.1.peg.581 - 290402.1 is the organism ID and it is gene number 581. This format is very convenient computationally, because it allows rapid identification of the organism associated with each gene. However, if you have a file with ITEP gene IDs in it, it is helpful to be able to replace these with either the gene IDs present in the Genbank files (particularly locus tags), their annotations, or the name of the organism rather than the ID. ITEP includes several scripts to do this which will be described briefly below.
If you have a tab-delimited file (as produced by many of the ITEP scripts) with gene IDs in it, you can add a new column containing organism names and (optionally) annotations to that table by running db_addOrganismNameToTable.py. For example:
$ db_getGenesWithAnnotation.py "6-phosphofructokinase"
fig|290402.1.peg.581 6-phosphofructokinase_YP_001307727.1_Cbei_0584
fig|290402.1.peg.992 6-phosphofructokinase_YP_001308138.1_Cbei_0998
fig|290402.1.peg.4768 6-phosphofructokinase_YP_001311914.1_Cbei_4852
fig|386415.1.peg.406 6-phosphofructokinase_YP_877380.1_NT01CX_1297
fig|931626.1.peg.1249 6-phosphofructokinase_YP_005268952.1_Awo_c12790_pfkA
$ db_getGenesWithAnnotation.py "6-phosphofructokinase" | db_addOrganismNameToTable.py
fig|290402.1.peg.581 6-phosphofructokinase_YP_001307727.1_Cbei_0584 Clostridium beijerinckii NCIMB 8052
fig|290402.1.peg.992 6-phosphofructokinase_YP_001308138.1_Cbei_0998 Clostridium beijerinckii NCIMB 8052
fig|290402.1.peg.4768 6-phosphofructokinase_YP_001311914.1_Cbei_4852 Clostridium beijerinckii NCIMB 8052
fig|386415.1.peg.406 6-phosphofructokinase_YP_877380.1_NT01CX_1297 Clostridium novyi NT
fig|931626.1.peg.1249 6-phosphofructokinase_YP_005268952.1_Awo_c12790_pfkA Acetobacterium woodii DSM 1030
ITEP IDs can be replaced with a particular alias by preparing a translation table and calling replaceGeneNamesWithAliases.py. A "master" translation table is automatically generated when you build the database and is located in the location: $root/aliases/aliases , where $root is the root directory of your ITEP repository. A single gene ID can be associated with multiple aliases - in order to run replaceGeneNamesWithAliases.py, you should reduce this to a single one. The easiest way to do this is with a regex; for example, all of the locus tags for C. beijerinckii match "Cbei_\d+" so to pull out all of the locus tag associations you would do:
$ cat $root/aliases/aliases | grep -P "Cbei_\d+" >> translation_table
Once you have a translation table you like, take the file ("itep_id_file") with ITEP IDs in it and run:
$ cat itep_id_file | replaceGeneNamesWithAliases.py translation_table
See Obtaining a list of bidirectional best blast hits for an example of this in action.
The aliases file can also be used to find the ITEP ID for a given alias by calling replaceAliasesWithGeneNames.py (see Generating draft metabolic reconstructions from a reference for an example).
In some applications (such as generation of organism phylogenies) you end up with organism IDs (e.g. 290402.1) in a file. In order to make these more readable when using outside programs, these organism IDs should be replaced with the organism's name using replaceOrgWithAbbrev.py.
Note - This script will ultimately be renamed and we will update this tutorial with the new name.