Skip to content

Commit

Permalink
Merge pull request #47 from bmvdgeijn/development
Browse files Browse the repository at this point in the history
Development
  • Loading branch information
gmcvicker authored Sep 4, 2016
2 parents c2509a4 + 5a1fbd5 commit 7b6f6bf
Show file tree
Hide file tree
Showing 194 changed files with 8,403 additions and 2,309 deletions.
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,15 @@
__pycache__/
*.py[cod]

# emacs tmp files
*~

# C extensions
*.so

# snakemake files
.snakemake

# Distribution / packaging
.Python
env/
Expand Down
31 changes: 31 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
Version 0.2 - September 3, 2016
-----------

Version 0.2 of WASP is a major update to the code,
especially the mapping code. It fixes several bugs related
to how paired-end reads are handled. For this reason it is
strongly recommended that users switch to this version
of the pipline.

Changes include:
* re-wrote mapping scripts to make simpler and more modular
* re-wrote mapping test scripts and added of many tests
* fixed several mapping pipeline bugs related to paired-end reads
* find_intersecting_snps.py window size no longer required (is now
unlimited)
* find_intersecting_snps.py can now take HDF5 files as input
* find_intersecting_snps.py can now consider only haplotypes
present in samples, rather than all possible allelic combinations
of SNPs overlapping reads.
* added get_as_counts.py script that outputs allele-specific read
counts at all polymorphic SNPs.
* snp2h5 now records sample info in output HDF5 files
* improved speed of many CHT pipeline steps
* improved stability of CHT dispersion parameter estimation
* added Snakemake workflows for both mapping and CHT pipelines
* added qqplot.R script to CHT workflow


Version 0.1
-----------
Initial version of WASP
6 changes: 6 additions & 0 deletions CHT/.gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
*.py[cod]

# snakemake files
.snakemake

# emacs backups
*~

# C extensions
*.so

Expand Down
10 changes: 7 additions & 3 deletions CHT/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,11 +93,15 @@ the [1000 Genomes website](http://www.1000genomes.org/data#DataAccess).

## Workflow

An example workflow is provided in [example_workflow.sh](../example_workflow.sh)
script. This workflow uses data in the [example_data directory](../example_data).
We now provide a Snakemake workflow that can be used to run the entire
CHT pipeline. For more information see the [Snakemake README](README.snakemake.md)

An example workflow in the form of a shell script is also provided in
[example_workflow.sh](../example_workflow.sh) script. This workflow uses
data in the [example_data directory](../example_data).

Some of the input files that we used for our paper can be downloaded from
[here](http://eqtl.uchicago.edu/histone_mods/haplotype_read_counts/).
[here](http://eqtl.uchicago.edu/histone_mods/).

The following steps can be used to generate input files and run the
Combined Haplotype Test. The examples given below use the example
Expand Down
91 changes: 91 additions & 0 deletions CHT/README.snakemake.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
## Snakemake CHT pipeline

[Snakemake](https://bitbucket.org/snakemake/snakemake/wiki/Home) is a
workflow management system, designed to streamline the execution of
software pipelines. We now provide a Snakemake rule file that can be
used to run the entire Combined Haplotype Pipeline.

For a more complete description of Snakemake see the
[Snakemake tutorial](http://snakemake.bitbucket.org/snakemake-tutorial.html).

## Installing Snakemake

Snakemake requires python3, however the CHT pipeline requires
python2. For this reason, if you are using
[Anaconda](https://www.continuum.io/downloads), it is recommended that
you create a [python3
environment](http://conda.pydata.org/docs/py2or3.html#create-a-python-3-5-environment). For example you can create a python3.5 Anaconda environment with the following shell command (this only needs to be done once):

conda create -n py35 python=3.5 anaconda

You can then activate the py35 environment, and install the latest version of
Snakemake with the following commands:

source activate py35
conda install snakemake

Then when you want to switch back to your default (e.g. python2) environment
do the following:

source deactivate


## Configuring the CHT pipeline

The rules for the Snakemake tasks are defined in the [Snakefile](Snakefile).

Configuration parameters for this Snakefile are read from the YAML file
[snake_conf.yaml](snake_conf.yaml).

Before running Snakemake edit this file to specify the location
of all of the input directories and files that will be used by the pipeline.
This includes locations of the impute2 SNP files, input BAM files etc.

Importantly you must set `wasp_dir` to point to the location of WASP
on your system, and set `py2` and `Rscript` to setup the environment
for python and R (e.g. by modifying your PATH) and call the
appropriate interpreter. This is necessary because Snakemake is run
using python3, but most of the scripts require python2.


## Running the CHT pipeline

Snakemake can be run as a single process or on a compute cluster with
multiple jobs running simultaneuously. To run Snakemake on a single node
you could do something like the following:

source activate py35
cd $WASP_DIR/CHT
snakemake

We provide a script [run_snakemake.sh](run_snakemake.sh) to run Snakemake
on a SGE compute cluster. You must be in a python3 environment to run this
script, and the script must be run from a job submission host.

source activate py35
cd $WASP_DIR/CHT
./run_snakemake.sh

It should be possible to make simple modifications to this script to
run on queue management systems other than SGE (e.g. LSF or Slurm).


You should Snakemake from within a [Screen](https://www.gnu.org/software/screen/) virtual terminal or using [nohup](https://en.wikipedia.org/wiki/Nohup) so
that if you are disconnected from the cluster, Snakemake will continue to run.

At the conclusion of the pipeline, a QQPlot will be generated that summarizes
the results of the CHT.


## Debugging the CHT pipeline

By default Snakemake will write an output and error file for each job
to your home directory. These files will be named like `snakejob.<rulename>.<job_num>.sh.{e|o}<sge_jobid>`. For example:

# contains error output for extract_haplotype_read_counts rule:
snakejob.extract_haplotype_read_counts.13.sh.e4507125

If a rule fails, you should check the appropriate output file to see what
error occurred. A major benefit of Snakemake is that if you re-run snakemake
after a job fails it will pickup where it left off.

Loading

0 comments on commit 7b6f6bf

Please sign in to comment.