Skip to content
This repository has been archived by the owner on Aug 23, 2022. It is now read-only.

ProteinsWebTeam/interpro-protein-update

Repository files navigation

InterPro Protein Update

⚠️ As of August 2019 we do not use this repository for the Protein Update. Please use pyinterprod.

The InterPro Protein Update is the procedures loading protein data from UniProtKB/Swiss-Prot and UniProtKB/TrEMBL flat files, and updating InterPro production tables to reflect changes.

Getting started

Requirements

  • Python 3.3+.
  • The numpy, and h5py Python packages.
  • the mundone, and pyswiss Python packages (included in this repository).

Installation

git clone https://github.com/ProteinsWebTeam/interpro-protein-update.git
cd interpro-protein-update
bash setup.sh

Configuration

Make a copy of config.ini.sample, and edit it.

Section Option Description Comment
Database host database TNS
user_pro interpro user connection string (interpro/********)
user_scan iprscan user connection string (iprscan/********)
user_parc uniparc user connection string (uniparc/********) Used only for tests, not in production, hence it can be let empty.
UniProt version release version (e.g. 2017_07)
date release date (e.g. 05-Jul-2017)
swissprot_file UniProtKB/Swiss-Prot flat file path
trembl_file UniProtKB/TrEMBL flat file path
Directories out output directory HDF5 and some log files
tmp temporary directory
tab table files directory For xref_summary table files
Cluster queue LSF queue name
Mail server mail server host
sender sender address Your EBI email address, or the team email address
interpro InterPro team email address
aa Automated Automation team email address
uniprot UniProt team email address

Workflow overview

Step Task Description Comment
Update 1A load_swissprot Stores UniProtKB/Swiss-Prot proteins in an HDF5 file
load_trembl Stores UniProtKB/TrEMBL proteins in an HDF5 file
dump_db Stores proteins in the InterPro database in an HDF5 file
merge_h5 Concatenates Swiss-Prot and TrEMBL proteins
insert_proteins Inserts protein changes and new proteins
method_changes Finds changes to assignments of signatures to InterPro entries
Update 1B update_proteins Updates production tables with protein data
UniParc.xref uniparc_xref Updates cross-references from UniParc
Pre-check IPRSCAN iprscan_precheck Checks if MV_IPRSCAN is ready (i.e. UniParc matches update completed) Skipped, unless explicitly called
Refresh IPRSCAN iprscan_refresh Refreshes MV_IPRSCAN with the latest data from ISPRO
Check IPRSCAN iprscan_check Generates the IPRSCAN health check
Refresh METHOD2SWISS_DE method2swiss Populates the METHOD2SWISS_DE table with Swiss-Prot descriptions Required by Happy Helper
Update 2 prepare_matches Finds new matches A pre-production report is generated, and must be checked
prepare_feature_matches Finds new feature matches A pre-production report is generated, and must be checked
Refresh AA_IPRSCAN aa_iprscan Recreate a materialized view with up-to-date data from MV_IPRSCAN
Update 3 update_matches Updates production tables with match data
update_feature_matches Updates production tables with feature match data
finalize Refresh match MV tables
refresh_go Refresh InterPro2GO MV tables Low priority
refresh_feature_matches Refresh feature match MV tables Low priority
Check CRC64 crc64 Deletes mismatched CRC64 in the protein table
Report method changes report_method_changes Final report that includes deleted, moved, and new signatures
Update SITE_MATCH site_match Inserts new matches into the SITE_MATCH table
XREF summary dump_xref Updates the XREF_SUMMARY table and dumps tab files

Running a step

python ipucli.py -c CONFIG -t [TASK [TASK ...]]

Where CONFIG is the path to the configuration file, and TASK are task names.

Notes

  • UNIPARC.PROTEIN is a materialised view and is not refreshed by this pipeline but by DBMS scheduler (in Oracle SQL Developer: Scheduler > DBMS Jobs, under the DBA Jobs tab).

About

InterPro Protein Update procedures

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published