The InterPro Protein Update is the procedures loading protein data from UniProtKB/Swiss-Prot and UniProtKB/TrEMBL flat files, and updating InterPro production tables to reflect changes.
- Python 3.3+.
- The
numpy
, andh5py
Python packages. - the
mundone
, andpyswiss
Python packages (included in this repository).
git clone https://github.com/ProteinsWebTeam/interpro-protein-update.git
cd interpro-protein-update
bash setup.sh
Make a copy of config.ini.sample
, and edit it.
Section | Option | Description | Comment |
---|---|---|---|
Database | host | database TNS | |
user_pro | interpro user connection string (interpro/********) | ||
user_scan | iprscan user connection string (iprscan/********) | ||
user_parc | uniparc user connection string (uniparc/********) | Used only for tests, not in production, hence it can be let empty. | |
UniProt | version | release version (e.g. 2017_07) | |
date | release date (e.g. 05-Jul-2017) | ||
swissprot_file | UniProtKB/Swiss-Prot flat file path | ||
trembl_file | UniProtKB/TrEMBL flat file path | ||
Directories | out | output directory | HDF5 and some log files |
tmp | temporary directory | ||
tab | table files directory | For xref_summary table files | |
Cluster | queue | LSF queue name | |
server | mail server host | ||
sender | sender address | Your EBI email address, or the team email address | |
interpro | InterPro team email address | ||
aa | Automated Automation team email address | ||
uniprot | UniProt team email address |
Step | Task | Description | Comment |
---|---|---|---|
Update 1A | load_swissprot | Stores UniProtKB/Swiss-Prot proteins in an HDF5 file | |
load_trembl | Stores UniProtKB/TrEMBL proteins in an HDF5 file | ||
dump_db | Stores proteins in the InterPro database in an HDF5 file | ||
merge_h5 | Concatenates Swiss-Prot and TrEMBL proteins | ||
insert_proteins | Inserts protein changes and new proteins | ||
method_changes | Finds changes to assignments of signatures to InterPro entries | ||
Update 1B | update_proteins | Updates production tables with protein data | |
UniParc.xref | uniparc_xref | Updates cross-references from UniParc | |
Pre-check IPRSCAN | iprscan_precheck | Checks if MV_IPRSCAN is ready (i.e. UniParc matches update completed) | Skipped, unless explicitly called |
Refresh IPRSCAN | iprscan_refresh | Refreshes MV_IPRSCAN with the latest data from ISPRO | |
Check IPRSCAN | iprscan_check | Generates the IPRSCAN health check | |
Refresh METHOD2SWISS_DE | method2swiss | Populates the METHOD2SWISS_DE table with Swiss-Prot descriptions | Required by Happy Helper |
Update 2 | prepare_matches | Finds new matches | A pre-production report is generated, and must be checked |
prepare_feature_matches | Finds new feature matches | A pre-production report is generated, and must be checked | |
Refresh AA_IPRSCAN | aa_iprscan | Recreate a materialized view with up-to-date data from MV_IPRSCAN | |
Update 3 | update_matches | Updates production tables with match data | |
update_feature_matches | Updates production tables with feature match data | ||
finalize | Refresh match MV tables | ||
refresh_go | Refresh InterPro2GO MV tables | Low priority | |
refresh_feature_matches | Refresh feature match MV tables | Low priority | |
Check CRC64 | crc64 | Deletes mismatched CRC64 in the protein table | |
Report method changes | report_method_changes | Final report that includes deleted, moved, and new signatures | |
Update SITE_MATCH | site_match | Inserts new matches into the SITE_MATCH table | |
XREF summary | dump_xref | Updates the XREF_SUMMARY table and dumps tab files |
python ipucli.py -c CONFIG -t [TASK [TASK ...]]
Where CONFIG
is the path to the configuration file, and TASK
are task names.
UNIPARC.PROTEIN
is a materialised view and is not refreshed by this pipeline but by DBMS scheduler (in Oracle SQL Developer: Scheduler > DBMS Jobs, under the DBA Jobs tab).