Skip to content
This repository has been archived by the owner on Feb 16, 2019. It is now read-only.

Obtaining information about genes

mattb112885 edited this page Nov 10, 2013 · 6 revisions

This page demonstrates different ways to get information such as location, protein and nucleotide sequences for genes. You can search for genes based on their ITEP ID or annotation. This section also describes a convenient interface to obtain this information for all genes in a cluster (or list of clusters).

Getting gene info from a list of gene IDs

Information about genes with a specific ITEP ID can be obtained in tab-delimited format ("geneinfo" tables) using the command db_getGeneInformation.py.

$ echo 'fig|290402.1.peg.1824' | db_getGeneInformation.py
fig|290402.1.peg.1824   Clostridium beijerinckii NCIMB 8052     290402.1        DEFAULT_1       290402.1.NC_009617.1    2133712 2134614 +       1       1-phosphofructokinase_YP_001308970.1_Cbei_1843  ATGATTAATACAATAAC...  MINTITLNPSLDYIVKVDSF...

The columns in this table are: gene ID, organism name, organism ID, organism abbreviation (OBSOLETE), contig ID, start NT, stop NT, strand, strandnum, annotation, nucleotide sequence, amino acid sequence. The start NT is the base number of the first nucleotide of the start codon in the gene (the first nucleotide in the contig is 1) and the stop NT is the base number of the last nucleotide of the stop codon. Therefore by this convention start < stop for + strand genes and start > stop for - strand genes. Strand is either + or -, and strandnum is the sign of the strand (1 for + strand, -1 for - strand).

You can chain this function together with search functions such as db_getGenesWithAnnotation.py, which allow you to find the ITEP ID for a gene of interest. For example, here is how to get the geneinfo for the gene with locus tag "Cbei_1843", which happens to be the same gene as the above.

$ db_getGenesWithAnnotation.py "Cbei_1843" | db_getGeneInformation.py -g 1
fig|290402.1.peg.1824   Clostridium beijerinckii NCIMB 8052     290402.1        DEFAULT_1       290402.1.NC_009617.1    2133712 2134614 +       1       1-phosphofructokinase_YP_001308970.1_Cbei_1843  ATGATTAATACAATAAC...  MINTITLNPSLDYIVKVDSF...

The -g 1 is necessary because the gene ID is the first column in the results from "db_getGenesWithAnnotation.py".

Getting gene information for all genes in a cluster

If you have a cluster ID (and corresponding run ID) that you know is interesting (such as one that contains a specific gene of interest), you can get the geneinfo table for all genes in that cluster using the db_getClusterGeneInformation.py function. For example, this set of commands gets all of the cluster\runID pairs containing the gene with locus tag "Cbei_1843", extracts the one for run ID "all_I_2.0_m_maxbit_c_0.4" and then gets the geneinfo for all genes in that cluster.

$ db_getGenesWithAnnotation.py "Cbei_1843" | db_getClustersContainingGenes.py -g 1 | grep "all_I_2.0_c_0.4_m_maxbit" | db_getClusterGeneInformation.py

The format of the output table is identical to that for db_getGeneInformation except the cluster\runID pair is outputted in the final two columns of the table (so that you can differentiate the results if you pipe in multiple sets of cluster\run ID pairs).

Clone this wiki locally