-
Notifications
You must be signed in to change notification settings - Fork 15
How to import genomes and format them for use with itep
There are two steps to data setup: Obtaining a Genbank file from your desired source, and running convertGenbankToTable.py to deal with all of the pre-processing necessary for import into ITEP. This tutorial covers many ways to get Genbank files and the use of that function.
- Obtain genbank files for your organism. You can download these from ftp://ftp.ncbi.nih.gov/genomes/ and its subfolders.
- Concatinate the genbank files if there is more than one contig (we need a “multi-genbank” file)
- Put the genbank file in any folder EXCEPT $root/genbank.
- Follow the directions below for using convertGenbank2Table.py for data import. Use any number you want for the version number - I recommend something like 88888 so you can easily distinguish which of your genomes came from NCBI. Ultimately it doesn't matter as long as different organisms with the same TaxID get different version numbers.
- Identify the SEED ID for your organism by searching for it on the SEED website (pubseed.theseed.org)
- Navigate to the following URL: ftp://ftp.theseed.org/genomes/genbank/
- Download the genbank file and place it in any folder EXCEPT $root/genbank
- Follow the directions below for using convertGenbank2Table.py for data import.
Note that the resulting ITEP gene IDs will be the same as the PubSEED gene IDs as long as you use the second number in the PubSEED's genome ID as the version number when running convertGenbankTable.py. For example, for E coli, 83333.1, you would download the PubSEED Genbank file and specify "1" as the version number.
RAST ( http://rast.nmpdr.org/ ) is a tool for calling and annotating genes in bacterial and archaeal genomes. If you want to analyze your own genomes and do not yet have gene calls and annotations, RAST is probably the most convenient way to get them. After your genomes are done being annotated you have two options for how to integrate the data into ITEP.
IMPORTANT. If you choose to use this method for import, the IDs in ITEP will NOT necessarily match the IDs given in RAST because RAST gene IDs don't appear in the genbank file. If you need them to match you will need to use the web interface (next section).
Every RAST job is given a "jobID". We have included a script that uses the MyRAST API to download results files from RAST (using it requires that you have installed MyRAST). The one you will need is the Genbank file.
-
Locate the Job ID for each genome you want to download from RAST.
-
Create a "jobs file" with two columns. The first column contains the job ID and the second one contains the organism's name (used for naming the files).
-
Use the get_RAST_jobs.pl (located in scripts/ ) function as follows to get the Genbank file:
$ get_RAST_jobs.pl -u (your RAST username) -p (your RAST password) -j (jobs file) -f genbank
Put the Genbank files anywhere EXCEPT $root/genbank and follow the directions below for using convertGenbank2Table.py for data import.
- Use the web interface to access your genome.
- Download the Genbank file and the 'Spreadsheet (tab delimited text format)'.
- Save the Genbank file to $ITEP_ROOT/genbank/ and the tab-delimited file to $ITEP_ROOT/raw/ with the same names as given by RAST.
Using this approach, the ITEP gene IDs will match what is in RAST. You can later add the ITEP (RAST) IDs back to the Genbank file using the command "addItepIdsToGenbank.py", if you need them for downstream analysis.
If you use this approach, you do not need to use convertGenbank2Table.py to process the Genbank file because you already have a "raw" file available in the correct format.
Note: You do not need to do this step if you used the web interface for RAST to download the tab delimited file and Genbank file.
Genbank files from any of the above sources will work with the following directions. User-supplied Genbank files should also work as long as they contain a TaxID, DNA sequences, an organism name and called genes in the appropriate places in the file.
The command to use is the following (note that convertGenbank2table.py is located in $root):
$ ./convertGenbank2table.py -g [genbank_file] -v [versionnum]
This command will automatically copy your genbank file to $root/genbank/[taxid].[versionnum].gbk, create a tab-delimited file with ITEP's required format in $root/genbank/[taxid].[versionnum].txt, and add protein IDs, gene names, and locus tags from the Genbank file to an aliases file which is located in $root/aliases/aliases . If you have aliases to add later that aren’t in the Genbank file you can add them manually to that file.
Run this command for each of your genbank files. If it complains about duplicated IDs and you really do have multiple genomes with the same TaxID, run the command with a different value for -v for each organism with the same TaxID. The version number is used by ITEP to distinguish between different organisms that are assigned the same TaxID, since NCBI no longer assigns unique TaxIDs to every organism.
For the sake of illustration, we will build a database containing two Clostridia species - Clostridium beijerinckii NCIMB 8052 and Clostridium novyi NT, and the outgroup Acetobacterium woodii DSM 1030. The analysis can, of course, be done with more organisms, but with only three building the ITEP database takes a very short time (the time goes as roughly O(n^2) where n is the number of genomes, because it requires running BLAST between each pair of genes).
Navigate to ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/ (all of these are complete genomes) and for each of the organisms do the following:
- Go into the folder for that organism
- Download all .gbk files for the organism into a scratch folder somewhere (Beware though: sometimes NCBI puts virus genomes, which have different taxIDs, in the same folder as the actual genome and plasmids which have the taxID you want. Make sure that all the TaxIDs are consistent between the Genbank files you download for the same organism).
- Concatenate the Genbank files for that organism (all three of our organisms have only one Genbank file so this is not necessary in this case) using the "cat" command.
Place the concatenated Genbank files somewhere OTHER THAN $root/genbank (where $root is root directory of the repository, containing the setup scripts). For each of the concatenated Genbank files run the following:
$ convertGenbank2table.py -g [genbank_file] -v 1
If done correctly for each of the Genbank files, the following files will be created:
$root/genbank/290402.1.gbk [Clostridium beijerinckii]
$root/genbank/386415.1.gbk [Clostridium novyi]
$root/genbank/931626.1.gbk [Acetobacterium woodii]
In addition, tab-delimited files in the format required by ITEP are created in the following directories:
$root/raw/290402.1.txt
$root/raw/386415.1.txt
$root/raw/931626.1.txt
After this action, a file enumerating the organisms must be created by running
./generateOrganismFileFromGbk.sh
If this worked, congratulations! You can now run the database-building commands and everything should work (if it doesn't work please file a bug and let me know what organisms you were using if possible). You can check that the formatting is as ITEP expects by running
./checkInputFormat.sh
It will complain about a couple of missing files that are created as part of the database-building scripts - you can ignore these errors for now since you haven't tried to build the database yet.
Genbank files used for input into ITEP must contain gene annotations (ITEP does not contain an annotation program). If the Genbank files are not annotated you will get empty gene information files.
To prevent parse errors, it would be helpful to get them from a program or database that provides them in a standard format (like those from NCBI or EBI), so that Biopython understands how to parse them.
Note that NCBI changes the format of Genbank files enough to break parsers from time to time. If the import fails, try upgrading your Biopython version, the issue may be resolved.