Repository for resource files for CAPICE and updating CAPICE model. It contains several modules listed below.
- Download/git clone the source code.
- Run
pip --no-cache-dir install -e '.[test]'
Please note that for each module to properly function, the installation has to be completed.
To test the individual modules, change directory to the specific module and run pytest
.
balance_dataset is a module dedicated to balancing out a CAPICE train-test and/or validation on a per-consequence per-allele frequency bin level.
Will output 2 TSV files: one dataset file that is balanced and a remainder dataset file.
For usage details, use balance-dataset -h
or python3 ./src/molgenis/capice_resources/balance_dataset -h
compare_model_features is a module dedicated to compare 2 capice explain
outputs.
Will output a file containing the normalized "gain" score and the feature ranking according to this normalized gain.
For usage details, use: compare-model-features -h
or python3 ./src/molgenis/capice_resources/compare_model_features -h
The module compare_model_performance
is a module dedicated to obtain the difference in performance between 2 CAPICE models.
It takes the "validation" dataset (or benchmark dataset, containing "binarized labels") and the capice predict
output files of 2 different models and measures the performance differences between them.
Output plots to show the user the differences in performance between "model 1" and "model 2".
Note that the filename of the score and "label" datasets will be displayed within the plots. It is up to the user to define easily distinguishable filenames.
For usage details, use: compare-model-performance -h
or python3 ./src/molgenis/capice_resources/compare_model_performance -h
The module process_vep
is a module available to process the train-test
and validation
VEP output TSVs (VCF to TSV using BCFTools, see usage of BCFTools below) back to usable files for the CAPICE training process.
The reason that this module requires both train-test
and validation
is so that duplicate entries between the supposed independently datasets are filtered out.
Please note that the input supplied to VEP is created using module train_data_creator
, as this module adds additional information required by this module.
Use this script to convert the VEP output VCF back to TSV.
What this script does is it creates a temporary output file using bcftools +split-vep
, duplicating each entry if more entries exist for that variant (for example due to transcripts) and separating by a tab.
Then the echo
call adds back the header to this temporary file and creates the final file.
For usage details, use: process_vep -h
or python3 ./src/molgenis/capice_resources/process_vep -h
The module threshold_calculator
is a module that uses the "validation" (or benchmark dataset, containing "binarized labels") and a capice predict
output file to find an optimal threshold according to the CAPICE publication (see: Methods: Threshold selection strategies).
It outputs both a plot with an overview of the supplied variants and the optimal recall threshold and a TSV containing all attempted thresholds with their performance metrics.
For usage details, use: threshold-calculator -h
or python3 ./src/molgenis/capice_resources/threshold_calculator -h
The module train_data_creator
is a module dedicated to creating train-test
and independent validation
datasets to be used in creating a new CAPICE model.
It uses as input the most recent ClinVar and VKGL to create these datasets.
For the validation dataset, only "high-quality" samples are randomly obtained from both files. All other samples are put into train-test
.
For usage details, see: train-data-creator -h
or python3 ./src/molgenis/capice_resources/train_data_creator -h
Template sbatch script to run VEP on a Slurm cluster. Ensure it is up-to-date and paths (and the bind
argument within args
) are adjusted accordingly before
using!
The script liftover_variants.sh converts a VCF containing GRCh37 variants to GRCh38 variants. For this script the user must ensure paths and variables are set correctly!
- Update the CAPICE tool:
- Create a new branch for CAPICE.
- Determine feature to add and check whether a VEP processor should be written for it (VEP processors usually don't have to be written for int/float values).
- Add feature to
capice/resources/train_features.json
and test_main_train.py test_integration_reset_train_features. - Create a Make new branch for CAPICE.
- Update VEP command in the
capice-resources/src/utility_scripts/slurm_run_vep.sh
. - clone capice-resources branch on a cluster
- create a working directory for the next step
- Use
<capice-resources>/utility_scripts/create_poc.sh
APPTAINER_BIND=/groups sbatch \ --output=<workdir>/poc.log \ --error=<workdir>/poc.err \ --export=APPTAINER_BIND create_poc.sh \ -a "<bind>" \ -n "<capice branch>" \ -b "<path_to/bcftools-<version>.sif>" \ -w "<workdir>" \ -r "<path_to/capice-resources/>" \ -p "<path_to/vip/>"
- Check the output log for test failures.
- Commit the updated slurm_run_vep.sh
- Update VEP command in the
capice/README.md
(Requirements & Usage -> VEP). - Update the README regarding VEP plugins and the VEP command if needed.
- commit capice files
-
- Obtain the latest GRCh38 VKGL release from the cluster
- Download the latest public GRCh38 Clinvar datasets (please note that these filenames are stored in the train-test and validation VCF, so file dates in the name of the files improves reproducibility).
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_<date>.vcf.gz
- Use
<capice-resources>/utility_scripts/create_and_train.sh
to create a train-test and validation files (workdir should already exist if SLURM output and error logs are to be written here):mkdir <workdir> APPTAINER_BIND=<bind> sbatch \ --job-name=capice_create_and_train --output=<workdir>/create_and_train.log \ --error=<workdir>/create_and_train.err \ --export=APPTAINER_BIND \ <path_to/capice-resources/>/utility_scripts/create_and_train.sh \ -p "<path_to/vip/>" \ -b "<path_to/bcftools-<version>.sif>" \ -w "<workdir>" \ -c "<path_to/clinvar_<date>.vcf.gz>" \ -v "<path_to/vkgl_public_consensus_hg38_<date>.tsv>" \ -r "<path_to/capice-resources/>" \ -m "<capice_production_model_filename>" \ -t "<capice_production_git_tag>" -d "<capice production traintest data file name>"
<workdir>/train/train.sh
can be used to train a new model on the generated train_test data:
sbatch <workdir>/train/train.sh <path_to/new_model_name.ubj>
- Check the created validation plots to see how the new model performs
- Check the explain / rank output
- Add validation plots to capice-resouces/validation
- Create a capice-resources pull-request.
- Create a capice pull-request.
- Create a capice-resources GitHub release draft and
- Create a capice release
- add the
<workdir>/train/train_test.vcf.gz
,<workdir>/train/validation.vcf.gz
and the new model.
- add the
- Create new Apptainer image:
- Copy this def file
- Update the defined capice version & filename.
- Run
sudo apptainer build sif/capice-<version>.sif def/capice-<version>.def
Obtain the latest public GRCh38 VKGL (from the cluster) and Clinvar datasets and simply supply them to train-data-creator
.
For further details, use trian-data-creator -h
.
VEP is used to add annotations to the train-test and validation files. The following command is a good representation of what VEP command should be called:
vep --input_file <path to your input file> --format vcf --output_file <path to your output file> --vcf --compress_output gzip --sift s --polyphen s --numbers --symbol --shift_3prime 1 --allele_number --no_stats --offline --cache --dir_cache */path/to/latest/vep/cache* --species "homo_sapiens" --assembly *GRCh37 or GRCh38* --refseq --use_given_ref --exclude_predicted --use_given_ref --flag_pick_allele --force_overwrite --fork 4 --dont_skip --allow_non_variant --per_gene
(--input_file
, --output_file
and --assembly
have to be manually altered)
For the most accurate VEP command, see the CAPICE README.md.
Command can vary based on the training features.
Comparing an XGBoost 0.72.1 (CAPICE 2.0.0 or lower) to a new XGBoost 1.4.2 model, upload the validation.vcf.gz to CADD for annotation. (set "INCLUDE ANNOTATIONS" to True)
Currently, there is no automated script in place to perform this check, since with CAPICE 4.0.0 the performance has been equalized.
First convert the VEP output VCF back to TSV using the conversion tool.
Making the converted VEP TSV train-ready requires use of the process-vep
module. This works for both GRCh37 and GRCh38 (GRCh38 only when the -a
flag is supplied.)
This script makes sure that variants with the correct labels are preserved (so that the labels remains accurate for the variant) and duplicate entries between train-test and validation are removed.
Before the training module of CAPICE can be called, make sure your train_features JSON used within training is up-to-date on your newly added and/or removed features.
Training a new model is rather easy, but takes quite a bit of time. It is recommended to start a slurm job on the GCC cluster and to let it run overnight.
The training module can be activated with the following command (assuming you have installed CAPICE the way it is
supposed to be: pip install .
or pip install capice
):
capice -v train -i /path/to/processed/train_test.tsv.gz -e /path/to/up/to/date/features.json -o /path/to/output.ubj
Visualizing the XGBoost tree is a useful tool to see how each and every individual sample is scored. To use this functionality, one must load in the pickled model. Also import matplotlib pyplot and plot_tree from xgboost. Then use matplotlib to make a very high resolution figure, use plot_tree to visualize the tree and then matplotlib again to export the figure. You will get some code that is similar to this:
from matplotlib import pyplot as plt
import xgboost as xgb
model = xgb.XGBClassifier()
model.load_model('/path/to/model.ubj')
fig, ax = plt.subplots(figsize=(160, 160)) # Note: (160,160) might still be too low of a resolution
# Note: the num_trees=model.best_iteration is still in concept till I figure out that XGBoost always uses the best model, or all models (there could easily be 300+ booster objects within an XGBClassifier class).
xgb.plot_tree(model, num_trees=model.best_iteration, ax=ax) # If this says that the object is too large, adjust resolution
fig.savefig('/path/to/save_fig.png')
- I'm getting an
ValueError: invalid literal for int() with base 10: 'chr1'
, what's wrong?
You've likely downloaded a genome build 38, which has a different annotation for the chromosome.
train_data_creator
only supports build 37.
Standing on the shoulders of giants. This project could not have possible without the existence of many other tools and resources. Among them we would like to thank the people behind the following projects: