Skip to content
This repository has been archived by the owner on Feb 16, 2019. It is now read-only.

Extracting dna and amino acid sequences from a region of a genome, gene or protein

mattb112885 edited this page May 9, 2013 · 8 revisions

DNA and amino acid sequences can be extracted from specific genes (or proteins) or contigs using the db_getSequencesFromBlastResults.py command. The name of the function comes from the fact that most of the time, the function will be used from BLAST results, but it can be used with any table as input that contains, in three separate columns:

  1. The name of a gene or a contig from which to get the sequence
  2. The start location of the gene or contig
  3. The stop location of the gene or contig.

Obviously the way that you call the function should be different depending on whether you want amino acid or nucleotide sequences, and whether you are starting with a gene (or protein) ID or a contig ID. Here's some common use cases:

Getting sequences from BLASTP results

The BLASTP results table contains locations of the HSP within the AMINO ACID sequence for each gene. Therefore you must call the db_getSequencesFromBlastResults.py function with the -p flag.

The column number options (-i for ID, -s for start location column and -e for end location column) are set up by default to get you the sequences of the target in a BLASTP results table.

As an example, going back to the example we discussed in an earlier tutorial, a search for genes homologous to "Cbei_1843" (a phosphofructokinase) gave many results; here is one of them.

$ db_getGenesWithAnnotation.py "Cbei_1843" | db_getBlastResultsContainingGenes.py -g 1 > Cbei_1843_blastp_res
fig|290402.1.peg.3996    fig|290402.1.peg.1824    25.66    304    194    13    11    297    10    298    1e-10    58.9    600.0    600.0

Note that the db_getBlastResultsContainingGenes.py function returns BLASTP results by default.

Now we get the sequence of the target gene (which in this case is fig|290402.1.peg.1824) using

$ cat Cbei_1843_blastp_res | db_getSequencesFromBlastResults.py -p

The result is the addition of an amino acid sequence to each line in the BLASTP results. The sequence appended to the above BLAST hit is

SLDYIVKVDSFKVD ....

This sequence is amino acids number 10 to 298 (columns 9 and 10 above) in the protein sequence for fig|290402.1.peg.1824.

To get the sequence of the HSP for the query we need to specify the columns: column 1 for the query gene ID, column 7 for the query gene start location and column 8 for the query gene stop location.

$ cat Cbei_1843_blastp_res | db_getSequencesFromBlastResults.py -p -i 1 -s 7 -e 8

This would append the amino acid sequence for amino acid positions 11 to 297 within the protein fig|290402.1.peg.3996.

Getting sequences from BLASTN results

BLASTN takes a gene sequence (nucleotide) as both query and target and returns as part of its output the nucleotide positions within the query and target that were found to be homologous.

First lets get some BLASTN results by using the -n flag on db_getBlastResultsContainingGenes.py

$ db_getGenesWithAnnotation.py "Cbei_1843" | db_getBlastResultsContainingGenes.py -g 1 -n > Cbei_1843_blastn_res

To get the correct sequences from BLASTN results you must use the -n flag which specifies that the positions are nucleotide positions within a gene:

$ cat Cbei_1843_blastn_res | db_getSequencesFromBlastResults.py -n

Just as in BLASTP, this result by default gives you the nucleotide sequences within the target gene. To get the nucleotide sequences within the query gene use this command:

$ cat Cbei_1843_blastn_res | db_getSequencesFromBlastResults.py -n -i 1 -s 7 -e 8
Clone this wiki locally