Skip to content

Commit

Permalink
Copied (not forked) from github.com/jeltje/varscan2. This is now the …
Browse files Browse the repository at this point in the history
…main github repository for this tool
  • Loading branch information
Jeltje committed Feb 28, 2017
1 parent 48a60be commit 64580e2
Show file tree
Hide file tree
Showing 10 changed files with 845 additions and 2 deletions.
Binary file added DNAcopy_1.48.0.tar.gz
Binary file not shown.
52 changes: 52 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
FROM ubuntu:14.04

MAINTAINER Jeltje van Baren, jeltje.van.baren@gmail.com

# create a working directory and work from there
RUN mkdir /tmp/install
WORKDIR /tmp/install

RUN apt-get update && apt-get install -y \
gcc \
make \
zlib1g-dev \
git \
wget \
python-numpy \
default-jre \
r-base \
bc

# DNAcopy version keeps changing so deprecated this:
# R and DNAcopy package (move to R library location)
#RUN apt-get install -y r-base
#RUN wget http://www.bioconductor.org/packages/release/bioc/src/contrib/DNAcopy_1.40.0.tar.gz
# instead:
COPY ./DNAcopy_1.48.0.tar.gz ./
RUN R CMD INSTALL DNAcopy_1.48.0.tar.gz

# Samtools 0.1.18 - note: 0.1.19 and 1.1 do NOT work, VarScan copynumber dies on the mpileup
RUN wget http://downloads.sourceforge.net/project/samtools/samtools/0.1.18/samtools-0.1.18.tar.bz2
RUN tar -xvf samtools-0.1.18.tar.bz2
# the make command generates a lot of warnings, none of them relevant to the final samtools code, hence 2>/dev/null
RUN (cd samtools-0.1.18/ && make DFLAGS='-D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE -D_USE_KNETFILE -D_CURSES_LIB=0' LIBCURSES='' 2>/dev/null && mv samtools /usr/local/bin)

# get varscan
RUN wget -O /usr/local/bin/VarScan.jar https://github.com/dkoboldt/varscan/releases/download/2.4.2/VarScan.v2.4.2.jar

# Move wrapper and helper scripts to same location
ADD ./run_varscan /usr/local/bin/
ADD ./separateArms.py /usr/local/bin/
ADD ./basicDNAcopy.R /usr/local/bin/
ADD ./meanLogRatioByChromosome.py /usr/local/bin/

# Set WORKDIR to /data -- predefined mount location.
RUN mkdir /data
WORKDIR /data

# And clean up
RUN apt-get clean && rm -rf /var/lib/apt/lists/* /tmp/install

ENTRYPOINT ["bash", "/usr/local/bin/run_varscan"]


68 changes: 68 additions & 0 deletions Dockstore.cwl
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
#!/usr/bin/env cwl-runner

cwlVersion: v1.0
class: CommandLineTool

id: Varscan2
label: Varscan2 workflow

baseCommand: ['-s', './']

description: |
A Docker container for a Varscan2 workflow. See the [github repo](https://github.com/Jeltje/varscan2) for more information.

requirements:
- class: DockerRequirement
dockerPull: quay.io/ucsc_cgl/varscan_cnv

inputs:
genome:
type: File
doc: Genome fasta
format: http://edamontology.org/format_1929
inputBinding:
prefix: -i

centromeres:
type: File
doc: Centromere bed file
format: http://edamontology.org/format_3003
inputBinding:
prefix: -b

targets:
type: File
doc: Exome Targets bed file
format: http://edamontology.org/format_3003
inputBinding:
prefix: -w

control_bam_input:
type: File
doc: The control exome BAM file used as input, it must be sorted.
format: http://edamontology.org/format_2572
inputBinding:
prefix: -c

tumor_bam_input:
type: File
doc: The tumor exome BAM file used as input, it must be sorted.
format: http://edamontology.org/format_2572
inputBinding:
prefix: -t

sample_id:
type: string?
default: mypatient
doc: sample ID to use in output
inputBinding:
prefix: -q


stdout: output.cnv

outputs:
- id: output
type: stdout


26 changes: 26 additions & 0 deletions Dockstore.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
{
"output": {
"path": "/tmp/varscan.cnv",
"class": "File"
},
"genome": {
"path": "http://hgwdev.cse.ucsc.edu/~jeltje/public_data/genome.fa.gz",
"class": "File"
},
"control_bam_input": {
"path":"https://dcc.icgc.org/api/v1/download?fn=/PCAWG/reference_data/data_for_testing/HCC1143_ds/HCC1143_BL.bam",
"class": "File"
},
"centromeres": {
"path": "http://hgwdev.cse.ucsc.edu/~jeltje/public_data/centromeres.bed",
"class": "File"
},
"tumor_bam_input": {
"path":"https://dcc.icgc.org/api/v1/download?fn=/PCAWG/reference_data/data_for_testing/HCC1143_ds/HCC1143_BL.bam",
"class": "File"
},
"targets": {
"path": "http://hgwdev.cse.ucsc.edu/~jeltje/public_data/targets.bed",
"class": "File"
}
}
125 changes: 123 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,123 @@
# dockstore_tool_varscan_cnv
Varscan copynumber variation caller workflow
# Varscan2

**This repository contains code to create a docker implementation of the Varscan2.4.2 copynumber variation (CNV) caller.**

Varscan2 was developed by Dan Koboldt (see References below). It can be used to detect copy number variation (CNV) in sample pairs, usually exomes from a tumor and control from one patient.

The Varscan2 executable (https://github.com/dkoboldt/varscan.git) combines several tools. It is meant to be run in a pipeline, during which different tools are called in sequence. For details on Varscan, see http://dkoboldt.github.io/varscan/


This repository ONLY contains a pipeline for Varscan2 **copynumber variation**. If you want to run other Varscan tools, please use Varscan2 directly. This docker container contains a wrapper script that uses Varscan tools and other programs *with specific parameters*. These may not be the perfect parameters for your particular samples. See below for the full list of pipeline steps.

**Inputs** to the program are a tumor/control pair of BAM files, a genome fasta file and several [bed format](https://genome.ucsc.edu/FAQ/FAQformat#format1) helper files (see below). The program will sort the bam files by coordinate if it finds this hasn't been done.
**Output** is a file with chromosome segments that are scored for amplification or deletion.

To get per-gene output, these scores must be mapped to an annotation, for example using [this program] (https://github.com/Jeltje/cnvtogenes)

## The code

The Varscan wrapper script runs the following:

1. samtools flagstat on each bam file
2. samtools mpileup on both bam files
3. Determine unique mapped read ratio
4. Varscan copynumber
5. Remove low coverage regions
6. Varscan copyCaller
7. Calculate median for recentering
8. Varscan copyCaller recenter
9. Separate chromosome arms
10. DNAcopy (CBS)
11. Merge chromosome arms

The chromosome arms are separated before the Circular Binary Segmentation (CBS) step to avoid making calls across centromeres.

## Getting the docker container

The latest Varscan docker image can be downloaded directly from quay.io using
`docker pull quay.io/jeltje/varscan2`

Alternatively, you can build from the github repo:
```
git clone https://github.com/jeltje/varscan2.git
cd varscan2
docker build -t jeltje/varscan2 .
```

## Running the docker container

For details on running docker containers in general, see the excellent tutorial at https://docs.docker.com/

To see a usage statement, run

```
docker run jeltje/varscan2 -h
```

### Example input:

```
docker run -v /path/to/input/files:/data jeltje/varscan2 -c normal.bam -t tumor.bam -q sampleid -i genome.fa -b centromeres.bed -w targets.bed -s tmpdir > varscan.cnv
```

where

`normal.bam` and `tumor.bam` are BAM format files of exome reads aligned to the genome.

`sampleid` is an identifier for the patient. This will be used in the output.

`genome.fa` is a fasta file containing the genome that was used to create the BAM files. This file can be gzipped (and must end with `.gz` if so). A samtools indexed `.fai` will be created if it is not present in the same directory as this file (for details see Other Considerations, below).

`centromeres.bed` is a [bed format file](https://genome.ucsc.edu/FAQ/FAQformat#format1) containing centromere locations. This list is used to remove centromeres from the CBS calls.

`targets.bed` is a list of exome targets in bed format. This is used as a 'whitelist' of genome regions so that off target alignments will not be used for analysis

`tmpdir` is a directory for temporary output files. If you set option -d, these files will be kept

Keep in mind that all these file locations must be with respect to your `/path/to/input/files`.

Centromeres for hg19 are provided ind the `/data` directory

> You can find centromere locations for genomes via
> http://genome.ucsc.edu/cgi-bin/hgTables
> Using the following selections:
> - group: Mapping and Sequencing
> - track:gap
> - filter - goes to new page, look for 'type does match' and type centromere, submit
> - output format: bed
> Submit, on the next page just press Get Bed
*SPECIAL NOTE* In newer genome versions there is a separate `centromere` track, but it may show multiple overlapping centromeres for the same chromosome. These must be reduced to a single centromere per chromosome, or the program will not work correctly.

## Output

Output is written to `STDOUT` and uses the following format:
```
sampleID chrom loc.start loc.end num.mark seg.mean
```

To get amplified or deleted segments from this file, a threshold must be applied. This is often set to `0.25/-0.25`,
and with a minimum number of 10 markers per segment.


## Other considerations

Tumor and control really must be from the same patient and processed in the same experiment. Batch effects are strong in exome experiments and using the wrong control renders Varscan output meaningless.

The genome file will be unzipped and/or indexed using `samtools faidx` if necessary.
To save time and index a genome, run
```
samtools faidx <genome.fa>
```
This creates an index named genome.fa.fai
The genome and the index must be in the same directory, and the genome file (not the index) is the input to run_varscan

The whitelist is a bed format file with the exome targets used in the experiment. It ensures that Varscan only uses target regions for its analysis and not any off target read matches. It is important to use the real list of exome targets. For meaningful results do not use a generic list.


## References

Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L, Miller CA, Mardis ER, Ding L, Wilson RK.
VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing.
Genome Res. 2012 Mar;22(3):568-76. doi: 10.1101/gr.129684.111.
24 changes: 24 additions & 0 deletions basicDNAcopy.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
library(DNAcopy)

args <- commandArgs(TRUE)

# arguments are input and SD

# Alway use the same random seed for reproducible results
set.seed(0xcafe) # cafe is a hex number

cn <- read.table(args[1],header=TRUE)
sd <- as.double(args[2])
CNA.object <-CNA(genomdat = cn$adjusted_log_ratio, chrom = cn$chrom, maploc = cn$chr_stop,
data.type = 'logratio', sampleid = "sample")

smoothed.CNA.object <- smooth.CNA(CNA.object)

segment.smoothed.CNA.object <- segment(smoothed.CNA.object, undo.splits="sdundo", undo.SD=sd, verbose=1)
p.segment.smoothed.CNA.object <- segments.p(segment.smoothed.CNA.object)

outfile <- paste(args[1], "SD", sd, "dnacopy.out", sep=".")

write.table(p.segment.smoothed.CNA.object[,1:6], file=outfile, quote=F, row.names=F, sep="\t")

detach(package:DNAcopy)
64 changes: 64 additions & 0 deletions meanLogRatioByChromosome.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
#!/usr/bin/python

import sys, os, re, argparse
from numpy import *

parser = argparse.ArgumentParser(description="Calculates log average per chromosome")
parser.add_argument('cpcalled', type=str,help="copyCalled output")

class Chrom(object):
"""Holds fragment scores by chromosomes """
def __init__(self, chrom, score, endpos):
self.frags = []
self.chrom = chrom
self.add(chrom, score, endpos)
def add(self, chrom, score, endpos):
if self.chrom == chrom:
self.frags.append(score)
self.lastfrag = endpos
return True
else:
return False
def stats(self):
self.mean = mean(self.frags)
self.std = std(self.frags)


if len(sys.argv)==1:
parser.print_help()
sys.exit(1)
args = parser.parse_args()

# Main
chromTable = [] # holds chrom objects
curChrom = Chrom('empty', 0, 0)
f = open(args.cpcalled,'r')
for line in f:
line = line.strip()
fields = line.split("\t")
chr = fields[0]
if chr == "chrom":
continue
score = float(fields[6])
endpos = int(fields[2])
if not (curChrom.add(chr, score, endpos)):
curChrom = Chrom(chr, score, endpos)
chromTable.append(curChrom)
f.close()

if len(chromTable) < 3:
print >>sys.stderr, "ERROR: Please enter whole genome file"
sys.exit(1)

means = []
for chr in chromTable:
if chr.lastfrag < 60000000: # skip MT, GL, Y
# if len(chr.frags) < 3000: # skip MT, GL, Y
continue
chr.stats()
means.append(chr.mean)

med = len(means)/2 # this rounds, which is perfect
means.sort()
print "%.2f" % mean([means[med-1], means[med], means[med+1]])

Loading

0 comments on commit 64580e2

Please sign in to comment.