This module contains a collection of utility classes for processing and loading PDB repository and derived data content using relational and document database servers. One target data store for these tools is a document database used to exchange content within the RCSB PDB data pipeline.
Download the library source software from the project repository:
git clone --recurse-submodules https://github.com/rcsb/py-rcsb_db.git
Optionally, run test suite (Python versions 3.9+) using setuptools or tox:
python setup.py test
or simply run
tox
Installation is via the program pip. To run tests from the source tree, the package must be installed in editable mode (i.e. -e):
pip install -e .
You will need a few packages, before pip install .
can work:
sudo apt install default-libmysqlclient-dev flex bison
To use and develop this package on macOS requires a number of packages that are not distributed as part of the base macOS operating system. The following steps provide one approach to creating the development environment for this package. First, install the Apple XCode package and associate command-line tools. This will provide essential compilers and supporting tools. The HomeBrew package manager provides further access to a variety of common open source services and tools. Follow the instructions provided by at the HomeBrew site to install this system. Once HomeBrew is installed, you can further install the MariaDB and MongoDB packages which are required to support the ExDB tools. HomeBrew also provides a variety of options for managing a Python virtual environments.
RCSB/PDB repository path details are stored as configuration options.
An example configuration file included in this package is viewable under rcsb/db/config
: exdb-config-example.yml. This example references dictionary resources and mock repository data
provided in the package in rcsb/mock-data/*
. The site_info_configuration
section
in this file provides database server connection details and common path details.
This is followed by sections specifying the dictionaries, helper functions, and
configuration used to define the schema for the each supported content type
(e.g., pdbx_core, chem_comp_core, bird_chem_comp_core,.. ).
A convenience CLI schema_update_cli
is provided for generating operational schema from
PDBx/mmCIF dictionary metadata. Schema are encoded for the ExDB API (rcsb), and
for the document schema encoded in JSON and BSON formats. The latter schema can be used to
validate the loadable document objects produced for the collections served by MongoDB.
=> schema_update_cli --help
usage: schema_update_cli [-h] [--update_chem_comp_ref]
[--update_chem_comp_core_ref]
[--update_bird_chem_comp_ref]
[--update_bird_chem_comp_core_ref]
[--update_bird_ref] [--update_bird_family_ref]
[--update_pdbx] [--update_pdbx_core]
[--update_repository_holdings]
[--update_entity_sequence_clusters]
[--update_data_exchange] [--update_ihm_dev]
[--update_drugbank_core] [--update_config_all]
[--update_config_deployed] [--update_config_test]
[--config_path CONFIG_PATH]
[--config_name CONFIG_NAME]
[--cache_path SCHEMA_CACHE_PATH]
[--schema_types SCHEMA_TYPES]
[--schema_levels SCHEMA_LEVELS] [--debug] [--mock]
optional arguments:
-h, --help show this help message and exit
--update_chem_comp_ref
Update schema for Chemical Component reference
definitions
--update_chem_comp_core_ref
Update core schema for Chemical Component reference
definitions
--update_bird_chem_comp_ref
Update schema for Bird Chemical Component reference
definitions
--update_bird_chem_comp_core_ref
Update core schema for Bird Chemical Component
reference definitions
--update_bird_ref Update schema for Bird reference definitions
--update_bird_family_ref
Update schema for Bird Family reference definitions
--update_pdbx Update schema for PDBx entry data
--update_pdbx_core Update schema for PDBx core entry/entity data
--update_repository_holdings
Update schema for repository holdings
--update_entity_sequence_clusters
Update schema for entity sequence clusters
--update_data_exchange
Update schema for data exchange status
--update_ihm_dev Update schema for I/HM dev entry data
--update_drugbank_core
Update DrugBank schema
--update_config_all Update using configuration settings (e.g.
DATABASE_NAMES_ALL)
--update_config_deployed
Update using configuration settings (e.g.
DATABASE_NAMES_DEPLOYED)
--update_config_test Update using configuration settings (e.g.
DATABASE_NAMES_TEST)
--config_path CONFIG_PATH
Path to configuration options file
--config_name CONFIG_NAME
Configuration section name
--cache_path CACHE_PATH
Schema cache directory path
--schema_types SCHEMA_TYPES
Schema encoding (rcsb|json|bson) (comma separated)
--schema_levels SCHEMA_LEVELS
Schema validation level (full|min) (comma separated)
--debug Turn on verbose logging
--mock Use MOCK repository configuration for dependencies and
testing
________________________________________________________________________________
For example, the following command will generate the JSON and BSON schema for the collections in the pdbx_core schema.
schema_update_cli --mock --schema_types json,bson \
--schema_level full \
--update_pdbx_core \
--cache_path . \
--config_path ./rcsb/db/config/exdb-config-example.yml \
--config_name site_info_configuration
A convenience CLI exdb_repo_load_cli
is provided to support loading PDB repositories
containing entry and chemical reference data content types in the form of document collections
compatible with MongoDB.
exdb_repo_load_cli --help
usage: exdb_repo_load_cli [-h] [--op OP_TYPE] [--load_type LOAD_TYPE]
[--database DATABASE_NAME]
[--config_path CONFIG_PATH]
[--config_name CONFIG_NAME] [--db_type DB_TYPE]
[--num_proc NUM_PROC] [--chunk_size CHUNK_SIZE]
[--document_style DOCUMENT_STYLE]
[--disable_read_back_check] [--schema_level SCHEMA_LEVEL]
[--load_id_list_path LOAD_ID_LIST_PATH]
[--load_file_list_path LOAD_FILE_LIST_PATH]
[--fail_file_list_path FAIL_FILE_LIST_PATH]
[--save_file_list_path SAVE_FILE_LIST_PATH]
[--file_limit FILE_LIMIT]
[--prune_document_size PRUNE_DOCUMENT_SIZE]
[--debug] [--mock] [--cache_path CACHE_PATH]
[--rebuild_cache] [--rebuild_schema]
[--vrpt_repo_path VRPT_REPO_PATH]
optional arguments:
-h, --help show this help message and exit
--op {pdbx_loader,build_resource_cache,pdbx_db_wiper,pdbx_id_list_splitter,pdbx_loader_check,etl_entity_sequence_clusters,etl_repository_holdings}
Loading operation to perform
--load_type {replace,full}
Type of load ('replace' for incremental and
multi-worker load, 'full' for complete and
fresh single-worker load)
--database {pdbx_core,pdbx_comp_model_core,bird_chem_comp_core,chem_comp,chem_comp_core,bird_chem_comp,bird,bird_family,ihm_dev}
Database to load (most common choices are:
'pdbx_core', 'pdbx_comp_model_core', or
'bird_chem_comp_core')
--config_path CONFIG_PATH
Path to configuration options file
--config_name CONFIG_NAME
Configuration section name
--document_style DOCUMENT_STYLE
Document organization (rowwise_by_name_with_c
ardinality|rowwise_by_name|columnwise_by_name
|rowwise_by_id|rowwise_no_name)
--cache_path CACHE_PATH
Cache path for resource files
--num_proc NUM_PROC Number of processes to execute (default=2)
--chunk_size CHUNK_SIZE
Number of files loaded per process
--max_step_length MAX_STEP_LENGTH
Maximum subList size (default=500)
--schema_level SCHEMA_LEVEL
Schema validation level (full|min)
--collection_list COLLECTION_LIST
Specific collections to load
--load_id_list_path LOAD_ID_LIST_PATH
Input file containing the list of IDs to load
in the current iteration by a single worker
--holdings_file_path HOLDINGS_FILE_PATH
File containing the complete list of all IDs
(or holdings files) that will be loaded
--load_file_list_path LOAD_FILE_LIST_PATH
Input file containing load file path list
(override automatic repository scan)
--fail_file_list_path FAIL_FILE_LIST_PATH
Output file containing file paths that fail
to load
--save_file_list_path SAVE_FILE_LIST_PATH
Save repo file paths from automatic file
system scan in this path
--load_file_list_dir LOAD_FILE_LIST_DIR
Directory path for storing load file lists
--num_sublists NUM_SUBLISTS
Number of sublists to create/load for the
associated database
--force_reload Force re-load of provided ID list (i.e.,
don't just load delta; useful for manual/test
runs).
--provider_types_exclude
Resource provider types to exclude
--db_type DB_TYPE Database server type (default=mongo)
--file_limit FILE_LIMIT
Load file limit for testing
--prune_document_size PRUNE_DOCUMENT_SIZE
Prune large documents to this size limit (MB)
--regex_purge Perform additional regex-based purge of all
pre-existing documents for loadType != 'full'
--data_selectors [ ...]
Data selectors, space-separated.
--disable_read_back_check
Disable read back check on all documents
--disable_merge_validation_reports
Disable merging of validation report data
with the primary content type
--debug Turn on verbose logging
--mock Use MOCK repository configuration for testing
--rebuild_cache Rebuild cached resource files
--rebuild_schema Rebuild schema on-the-fly if not cached
--vrpt_repo_path VRPT_REPO_PATH
Path to validation report repository
________________________________________________________________________________
The following commands demonstrate how each type of operation (--op
) is used for loading of PDB repository data to ExDB. For all commands, the following environmental variables must first be set:
export CONFIG_SUPPORT_TOKEN_ENV=personal_token_used_for_decrypting_config_variables
export OE_LICENSE=/path/to/oe_license.txt
export NLTK_DATA=/path/to/nltk_data
--op build_resource_cache
- Build the external resource cache that will be used for and integrated with the loading of PDB structure data.
exdb_repo_load_cli --op "build_resource_cache" \
--config_path "/opt/etl-scratch/config/exdb-loader-config.yml" \
--config_name "site_info_remote_configuration" \
--num_proc 6 \
--cache_path "/opt/etl-scratch/data/CACHE" \
--op pdbx_db_wiper
- Wipe the pre-existing database (and all of its collections).
exdb_repo_load_cli --op "pdbx_db_wiper" \
--database "pdbx_core" \
--config_path "/opt/etl-scratch/config/exdb-loader-config.yml" \
--config_name "site_info_remote_configuration" \
--cache_path "/opt/etl-scratch/data/CACHE" \
--op pdbx_id_list_splitter
- Split the full list of input IDs into smaller, equally-sized sublists.
exdb_repo_load_cli --op "pdbx_id_list_splitter" \
--database "pdbx_core" \
--config_path "/opt/etl-scratch/config/exdb-loader-config.yml" \
--config_name "site_info_remote_configuration" \
--cache_path "/opt/etl-scratch/data/CACHE" \
--load_file_list_dir "/opt/etl-scratch/work-dir/load_file_lists" \
--holdings_file_path "https://files.wwpdb.org/pub/pdb/holdings/released_structures_last_modified_dates.json.gz" \
--num_sublists 10 \
--op pdbx_loader
- Load a list of entry IDs to ExDB.
exdb_repo_load_cli --op "pdbx_loader" \
--database "pdbx_core" \
--load_type replace \
--config_path /opt/etl-scratch/config/exdb-loader-config.yml \
--config_name site_info_remote_configuration \
--num_proc 8 \
--chunk_size 5 \
--max_step_length 500 \
--load_id_list_path "/opt/etl-scratch/work-dir/load_file_lists/pdbx_core_ids-1.txt" \
--cache_path "/opt/etl-scratch/data/CACHE" \
--op pdbx_loader_check
- Check the resulting ExDB database to confirm that all expected documents were loaded.
exdb_repo_load_cli --op "pdbx_loader_check" \
--database "pdbx_core" \
--config_path "/opt/etl-scratch/config/exdb-loader-config.yml" \
--config_name "site_info_remote_configuration" \
--cache_path "/opt/etl-scratch/data/CACHE" \
--load_file_list_dir "/opt/etl-scratch/work-dir/load_file_lists" \
--holdings_file_path "https://files.wwpdb.org/pub/pdb/holdings/released_structures_last_modified_dates.json.gz" \
--num_sublists 10 \
Part of the schema definition process supported by this module involves refining
the dictionary metadata with more specific data typing and coverage details.
A scanning tools is provided to collect and organize these details for the
other ETL tools in this package. The following convenience CLI, repo_scan_cli
,
is provided to scan supported PDB repository content and update data type and coverage details.
repo_scan_cli --help
usage: repo_scan_cli [-h] [--scanType SCANTYPE]
[--scan_chem_comp_ref | --scan_chem_comp_core_ref | --scan_bird_chem_comp_ref | --scan_bird_chem_comp_core_ref | --scan_bird_ref | --scan_bird_family_ref | --scan_entry_data | --scan_ihm_dev]
[--config_path CONFIG_PATH] [--config_name CONFIG_NAME]
[--input_file_list_path INPUT_FILE_LIST_PATH]
[--output_file_list_path OUTPUT_FILE_LIST_PATH]
[--fail_file_list_path FAIL_FILE_LIST_PATH]
[--scan_data_file_path SCAN_DATA_FILE_PATH]
[--coverage_file_path COVERAGE_FILE_PATH]
[--type_map_file_path TYPE_MAP_FILE_PATH]
[--num_proc NUM_PROC] [--chunk_size CHUNK_SIZE]
[--file_limit FILE_LIMIT] [--debug] [--mock]
[--working_path WORKING_PATH]
optional arguments:
-h, --help show this help message and exit
--scanType SCANTYPE Repository scan type (full|incr)
--scan_chem_comp_ref Scan Chemical Component reference definitions (public
subset)
--scan_chem_comp_core_ref
Scan Chemical Component Core reference definitions
(public subset)
--scan_bird_chem_comp_ref
Scan Bird Chemical Component reference definitions
(public subset)
--scan_bird_chem_comp_core_ref
Scan Bird Chemical Component Core reference
definitions (public subset)
--scan_bird_ref Scan Bird reference definitions (public subset)
--scan_bird_family_ref
Scan Bird Family reference definitions (public subset)
--scan_entry_data Scan PDB entry data (current released subset)
--scan_ihm_dev Scan PDBDEV I/HM entry data (current released subset)
--config_path CONFIG_PATH
Path to configuration options file
--config_name CONFIG_NAME
Configuration section name
--input_file_list_path INPUT_FILE_LIST_PATH
Input file containing file paths to scan
--output_file_list_path OUTPUT_FILE_LIST_PATH
Output file containing file paths scanned
--fail_file_list_path FAIL_FILE_LIST_PATH
Output file containing file paths that fail scan
--scan_data_file_path SCAN_DATA_FILE_PATH
Output working file storing scan data (Pickle)
--coverage_file_path COVERAGE_FILE_PATH
Coverage map (JSON) output path
--type_map_file_path TYPE_MAP_FILE_PATH
Type map (JSON) output path
--num_proc NUM_PROC Number of processes to execute (default=2)
--chunk_size CHUNK_SIZE
Number of files loaded per process
--file_limit FILE_LIMIT
Load file limit for testing
--debug Turn on verbose logging
--mock Use MOCK repository configuration for testing
--working_path WORKING_PATH
Working path for temporary files
________________________________________________________________________________
The following CLI provides a preliminary access to ETL functions for processing derived content types such as sequence comparative data.
etl_exec_cli --help
usage: etl_exec_cli [-h] [--full] [--etl_entity_sequence_clusters]
[--etl_repository_holdings] [--data_set_id DATA_SET_ID]
[--sequence_cluster_data_path SEQUENCE_CLUSTER_DATA_PATH]
[--sandbox_data_path SANDBOX_DATA_PATH]
[--config_path CONFIG_PATH] [--config_name CONFIG_NAME]
[--db_type DB_TYPE] [--read_back_check]
[--num_proc NUM_PROC] [--chunk_size CHUNK_SIZE]
[--document_limit DOCUMENT_LIMIT]
[--prune_document_size PRUNE_DOCUMENT_SIZE] [--debug]
[--mock] [--cache_path CACHE_PATH] [--rebuild_cache]
optional arguments:
-h, --help show this help message and exit
--full Fresh full load in a new tables/collections (Default)
--etl_entity_sequence_clusters
ETL entity sequence clusters
--etl_repository_holdings
ETL repository holdings
--data_set_id DATA_SET_ID
Data set identifier (default= 2018_14 for current
week)
--sequence_cluster_data_path SEQUENCE_CLUSTER_DATA_PATH
Sequence cluster data path (default set by
configuration
--sandbox_data_path SANDBOX_DATA_PATH
Date exchange sandboxPath data path (default set by
configuration
--config_path CONFIG_PATH
Path to configuration options file
--config_name CONFIG_NAME
Configuration section name
--db_type DB_TYPE Database server type (default=mongo)
--read_back_check Perform read back check on all documents
--num_proc NUM_PROC Number of processes to execute (default=2)
--chunk_size CHUNK_SIZE
Number of files loaded per process
--document_limit DOCUMENT_LIMIT
Load document limit for testing
--prune_document_size PRUNE_DOCUMENT_SIZE
Prune large documents to this size limit (MB)
--debug Turn on verbose logging
--mock Use MOCK repository configuration for testing
--cache_path CACHE_PATH
Path containing cache directories
--rebuild_cache Rebuild cached resource files
________________________________________________________________________________
(Note: The examples below are outdated and may not function as described. They are only kept here for historical reference.)
If you are working in the source repository, then you can run the CLI commands in the following manner. The following examples load data in the mock repositories in source distribution assuming you have a local default installation of MongoDb (no user/pw assigned).
To run the command-line interface exdb_repo_load_cli
outside of the source distribution, you will need to
create a configuration file with the appropriate path details and authentication credentials.
For instance, to perform a fresh/full load of all of the chemical component definitions in the mock repository:
cd rcsb/db/cli
python RepoLoadExec.py --full --load_chem_comp_ref \
--config_path ../config/exdb-config-example.yml \
--config_name site_info_configuration \
--fail_file_list_path failed-cc-path-list.txt \
--read_back_check
The following illustrates, a full load of the mock structure data repository followed by a reload with replacement of this same data.
cd rcsb/db/cli
python RepoLoadExec.py --mock --full --load_entry_data \
--config_path ../config/exdb-config-example.yml \
--config_name site_info_configuration \
--save_file_list_path LATEST_PDBX_LOAD_LIST.txt \
--fail_file_list_path failed-entry-path-list.txt
python RepoLoadExec.py --mock --replace --load_entry_data \
--config_path ../config/exdb-config-example.yml \
--config_name site_info_configuration \
--load_file_list_path LATEST_PDBX_LOAD_LIST.txt \
--fail_file_list_path failed-entry-path-list.txt