speclet - A Bayesian hierarchical model to discover tissue-specific cancer driver genes and synthetic lethal interactions from CRISPR/Cas9 LoF screens
The speclet model accounts for cell line- and chromosome-specific differences while simultaneously measuring the effect of targeting each gene across multiple molecular covariates including copy number, mRNA expression, and mutation status. The effect of the presence of mutations to key driver and tumor suppressor genes is also included to identify putative synthetic lethal interactions. The results of this project have been published in Chapter 4 of my Ph.D. dissertation available here: "Studying the tissue-specificity of cancer driver genes through KRAS and genetic dependency screens" (link to come soon).
Many setup and running commands have been added as
make
commands. Runmake help
to see the options available.
There are two 'conda' environments for this project: the first speclet
for modeling and analysis, the second speclet_smk
for the pipelines.
They can be created using the following commands.
Here, we use 'mamba' as a drop-in replacement for 'conda' to speed up the installation process.
conda install -n base -c conda-forge mamba
mamba env create -f conda.yaml
mamba env create -f conda_smk.yaml
Either environment can then be used like a normal 'conda' environment.
For example, below is the command it activate the speclet
environment.
conda activate speclet
Alternatively, the above commands can be accomplished using the make pyenvs
command.
# Same as above.
make pyenvs
On O2, because I don't have control over the base
conda environment, I follow the incantations below for each environment:
conda create -n speclet --yes -c conda-forge python=3.9 mamba
conda activate speclet && mamba env update --name speclet --file conda.yaml
In addition to that fun, there is also a problem with installing Python 3.10 on the installed version of conda, so I find I need to instead install 3.9 and then let the mamba install step update it.
Some additions to the environment need to be made in order to use a GPU for sampling from posterior distributions with the JAX backend in PyMC.
There are instructions provided on the JAX GitHub repo and the PyMC repo
First, the cuda
and cudnn
libraries need to be installed.
Second, a specific distribution of jax
should be installed.
At the time of writing, the following commands work, but I would recommend consulting the two links above if doing this again in the future.
mamba install --yes -c nvidia "cuda>=11.1" "cudnn>=8.2"
pip install --upgrade "jax[cuda]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
These commands have been added to the Makefile under the command make gpu
.
Use the same commands with the speclet_smk
environment active to be able to use the GPU in the pipelines.
The 'renv' package is used to manage the R packages. R is only used for data processing in this project. The environment can be setup in multiple ways. The first is by entering R and following the prompts to install the necessary packages. Another option is to install 'renv' and running its restore command, as shown below in the R console.
install.packages("renv")
renv::restore()
This can simply be accomplished with the following make
command.
make renv
Installation of the Python virtual environment can be confirmed by running the 'speclet' test suite.
conda activate speclet
pytest
# Alternatively
make test # or make test_o2 if on O2 HPC
If you plan to work on the code in this project, I recommend install 'precommit' so that all git commits are first checked for various style and code features.
The package is included in the speclet
virtual environment so you just need to run the following command once.
precommit install
There are options for configuration in the "project-config.yaml" file. There are controls for various constants and parameters for analyses and pipelines. Most are intuitively named.
There is a required ".env" file that should be configured as follows.
PROJECT_ROOT=${PWD} # location of the root directory
PROJECT_CONFIG=${PROJECT_ROOT}/project-config.yaml # location of project config file
An optional global environment that is used by 'speclet' is AESARA_GCC_FLAG
to set any desired Aesara gcc/g++ flags in the pipelines.
I need to have it set so that theano uses the correct gcc and blas modules when running in pipelines on O2 (see issue #151 for details).
The data is downloaded to the "data/" directory and prepared in the "munge/" directory. The prepared data is available in "modeling_data/". Please see the READMEs in the respective directories for more information.
All of the data can be downloaded and prepared using the following commands.
make download_data
make munge # or `make munge_o2` if on O2 HPC
Exploration and analyses are conducted in the "notebooks/" directory. Subdirectories divide related notebooks. See the README in that directory for further details.
All shared Python code is contained in the "speclet/" directory. The installation of this directory as an editable module should be done automatically when the conda environment is created. If this failed, the module can be installed using the following command.
# Run only if the module was not automatically installed by conda.
pip install -e .
The modules are tested using 'pytest' – see below for how to run the tests. They also conform to the 'black' and 'isort' formatters and make heavy use of Python's type-hinting system checked by 'mypy'. The functions are well documented using the Google documentation style and are checked by 'pydocstyle'.
All pipelines and associated files (e.g. configurations and runners) are in the "pipelines/" directory.
Each pipeline contains an associated bash
script and make
command that can be used to run the pipeline (usually on O2).
See the README in the "pipelines/" directory for more information.
Standardized reports are available in the "reports/" directory. Each analysis pipeline has a corresponding subdirectory in the reports directory. These notebooks are meant as quick, standardized reports to check on the results of a pipeline. More detailed analyses are in the "notebooks/" section.
Presentations that involved this project are stored in the "presentations/" directory. More information is available in the README in that directory.
Tests in the "tests/" directory have been written against the modules in "speclet/" using 'pytest' and 'hypothesis'. They can be run using the following command.
# Run full test suite.
pytest
# Or run the tests in two groups simultaneously.
make test # `test_o2` on O2 HPC
The coverage report can be shown by adding the --cov="speclet"
flag.
Some tests are slow because they involve the creation of models or sampling/fitting them.
These can be skipped using the -m "not slow"
flag.
Some tests require the ability to construct plots (using the 'matplotlib' library), but not all platforms (notably the HMS research computing cluster) provide this ability.
These tests can be skipped using the -m "not plots"
flag.
These tests are automatically run on GitHub Actions on pushes or PRs with the master
git branch.
The most recent results can be seen here.
Each individual pipeline can be run through a bash
script or a make
command.
See the pipelines README for full details.
The notebooks contain the analyses of the models and additional exploration of the data and other model designs. See the "notebooks/" directory for information the running these analyses.
The entire project can be installed from scratch and all analysis run with the following make
command.
make build # or `build_o2` on the O2 HPC