Code for making bioinformatics work with the SEED subsystems database a little easier.
Location for obtaining the raw files - the FTP Seed site, Subsystems folder: ftp://ftp.theseed.org/subsystems/
For a full write-up of the initial reasons behind this code, check out the blog post here: http://www.metatranscriptomics.com/2017/01/working-with-subsystems-functional.html
- Subsystems_simplifier.py - takes the "subsystems.complex" big file and makes it into a bioinformatics-friendly layout (FASTA, suitable to be used by annnotation algorithms like BLAST and DIAMOND as a database)
- FIG_extractor.py - takes the "subsystems.complex" big file and extracts Fig IDs and their associated functions (level 4 in the Subsystems hierarchy, the lowest level)
- fig_swapper.py - takes four files - the output of FIG_extractor, a results file with Fig IDs, the "subsystems2role" file, and "subsys.txt" file - and adds both function names and hierarchy (when it can be found) to each Fig ID in the results file.
- subsys_db_rebuilder - performs all of the above steps to create one large index file with all combined information. Best for creating a single custom database that can be searched later.
Program used: subsys_db_rebuilder.py
Files needed:
- subsystems.complex
- subsys2role
- subsys2peg
- subsys.txt
Command:
`python2.7 subsys_db_rebuilder.py subsystems.complex subsys2role subsys2peg subsys.txt`
`sed 's/\t/ /g' subsystems.complex.merged > notabs.subsystems.complex.merged`
`python2.7 duplicate_counter.py notabs.subsystems.complex.merged`
Note that the four input files need to be in this specific order for the program to work.
These commands do the following:
- Merge the individual files into one complete overall file
- Replace tabs with spaces (needed if converting into a BLAST database)
- Scrub out duplicate sequences - this significantly reduces the total database footprint, from 7 Gb down to 4.4 Gb.
Output: Two files:
- notabs.subsystems.complex.merged.reduced - this large file, in FASTA format, contains all UNIQUE protein sequences from the Subsystems database. The header contains - tab separated:
- The Fig ID
- The specific function (level 4 hierarchy)
- Level 3 hierarchy
- Level 2 hierarchy
- Level 1 hierarchy (top level of Subsystems)
- Other IDs (GI accessions, GO terms, etc.)
- subsystems.complex.no_hierarchy - not all Subsystems sequences have hierarchy information. These "NO HIERARCHY" sequences are carried over into subsystems.complex.merged, but are also printed in this file, in the same format as above.
Perhaps you don't care about rebuilding the entire database, and just want to know the hierarchy for your list of Fig IDs. In that case, the following program should help.
Program used: fig_swapper.py
Files needed: subsystems2role
subsystems2peg
subsys.txt
tab-separated results file, with Fig IDs in column 3
Command:
python2.7 fig_swapper.pysubsystems2peg results_file subsystems2role subsys.txt
Output: The input file with ".converted" as a suffix. This file should contain the original 3 columns, with additional hierarchy information added after the Fig ID as extra columns. These columns go in reverse hierarchy; level 4, level 3, level 2, and level 1 as the right-most column.