BPNet is a python package with a CLI to train and interpret base-resolution deep neural networks trained on functional genomics data such as ChIP-nexus or ChIP-seq. It addresses the problem of pinpointing the regulatory elements in the genome:
Specifically, it aims to answer the following questions:
- What are the sequence motifs?
- Where are they located in the genome?
- How do they interact?
For more information, see the BPNet manuscript:
Deep learning at base-resolution reveals motif syntax of the cis-regulatory code (http://dx.doi.org/10.1101/737981.)
Main documentation of the bpnet package and an end-to-end example higlighting the main features are contained in the following colab notebook https://colab.research.google.com/drive/1VNsNBfugPJfJ02LBgvPwj-gPK0L_djsD. You can run this notebook yourself by clicking on 'Open in playground'. Individual cells of this notebook can be executed by pressing the Shift+Enter keyboard shortcut.
To learn more about colab, visit https://colab.research.google.com and follow the 'Welcome To Colaboratory' notebook.
Compute data statistics to inform hyper-parameter selection such as choosing to trade off profile vs total count loss (lambda
hyper-parameter):
bpnet dataspec-stats dataspec.yml
Train a model on BigWig tracks specified in dataspec.yml using an existing architecture bpnet9 on 200 bp sequences with 6 dilated convolutional layers:
bpnet train --premade=bpnet9 dataspec.yml --override='seq_width=200;n_dil_layers=6' .
Compute contribution scores for regions specified in the dataspec.yml
file and store them into contrib.scores.h5
bpnet contrib . --method=deeplift contrib.scores.h5
Export BigWig tracks containing model predictions and contribution scores
bpnet export-bw . --regions=intervals.bed --scale-contribution bigwigs/
Discover motifs with TF-MoDISco using contribution scores stored in contrib.scores.h5
, premade configuration modisco-50k and restricting the number of seqlets per metacluster to 20k:
bpnet modisco-run contrib.scores.h5 --premade=modisco-50k --override='TfModiscoWorkflow.max_seqlets_per_metacluster=20000' modisco/
Determine motif instances with CWM scanning and store them to motif-instances.tsv.gz
bpnet cwm-scan modisco/ --contrib-file=contrib.scores.h5 modisco/motif-instances.tsv.gz
Generate additional reports suitable for ChIP-nexus or ChIP-seq data:
bpnet chip-nexus-analysis modisco/
Note: these commands are also accessible as python functions:
bpnet.cli.train.bpnet_train
bpnet.cli.train.dataspec_stats
bpnet.cli.contrib.bpnet_contrib
bpnet.cli.export_bw.bpnet_export_bw
bpnet.cli.modisco.bpnet_modisco_run
bpnet.cli.modisco.cwm_scan
bpnet.cli.modisco.chip_nexus_analysis
bpnet.seqmodel.SeqModel
- Keras model container specified by implementing output 'heads' and a common 'body'. It contains methods to compute the contribution scores of the input sequence w.r.t. differnet output heads.bpnet.BPNet.BPNetSeqModel
- Wrapper aroundSeqModel
consolidating profile and total count predictions into a single output per task. It provides methods to export predictions and contribution scores to BigWig files as well as methods to simulate the spacing between two motifs.bpnet.cli.contrib.ContribFile
- File handle to the HDF5 containing the contribution scoresbpnet.modisco.files.ModiscoFile
- File handle to the HDF5 file produced by TF-MoDISco.bpnet.modisco.core.Pattern
- Object containing the PFM, CWM and optionally the signal footprintbpnet.modisco.core.Seqlet
- Object containing the seqlet coordinates.bpnet.modisco.core.StackedSeqletContrib
- Object containing the sequence, contribution scores and raw data at seqlet locations.
bpnet.dataspecs.DataSpec
- File handle to thedataspec.yml
filedfi
- Frequently used alias for a pandasDataFrame
containing motif instance coordinates produced bybpnet cwm-scan
. See the colab notebook for the column description.
Supported python version is 3.6. After installing anaconda (download page) or miniconda (download page), create a new bpnet environment by executing the following code:
# Clone this repository
git clone git@github.com:kundajelab/bpnet.git
cd bpnet
# create 'bpnet' conda environment
conda env create -f conda-env.yml
# Disable HDF5 file locking to prevent issues with Keras (https://github.com/h5py/h5py/issues/1082)
echo 'export HDF5_USE_FILE_LOCKING=FALSE' >> ~/.bashrc
# Activate the conda environment
source activate bpnet
Alternatively, you could also start a fresh conda environment by running the following
conda env create -n bpnet python=3.6
source activate bpnet
conda install -c bioconda pybedtools bedtools pybigwig pysam genomelake
pip install git+https://github.com/kundajelab/DeepExplain.git
pip install tensorflow~=1.0 # or tensorflow-gpu if you are using a GPU
pip install bpnet
echo 'export HDF5_USE_FILE_LOCKING=FALSE' >> ~/.bashrc
When using bpnet from the command line, don't forget to activate the bpnet
conda environment before:
# activate the bpnet conda environment
source activate bpnet
# run bpnet
bpnet <command> ...
To use the --vmtouch
in bpnet train
command and thereby speed-up data-loading, install vmtouch. vmtouch is used to load the bigWig files into system memory cache which allows multiple processes to access
the bigWigs loaded into memory.
Here's how to build and install vmtouch:
# ~/bin = directory for localy compiled binaries
mkdir -p ~/bin
cd ~/bin
# Clone and build
git clone https://github.com/hoytech/vmtouch.git vmtouch_src
cd vmtouch_src
make
# Move the binary to ~/bin
cp vmtouch ../
# Add ~/bin to $PATH
echo 'export PATH=$PATH:~/bin' >> ~/.bashrc