Skip to content

Releases: ENCODE-DCC/chip-seq-pipeline2

v1.5.0

24 Jun 22:51
6655a2d
Compare
Choose a tag to compare

Upgraded WDL to 1.0

  • Added metadata to WDL
    • Removed hacky comments for Caper.
    • meta for general metadata for pipeline (e.g. version, docker image)
    • parameter_meta for input parameters.

Pooling control

  • chip.always_use_pooled_control is now true by default, which means that pipeline always try to pool controls if multiple control replicates are defined. And such pooled control is used for calling peaks on each experiment replicate.

Added control mode

  • Added control to chip.pipeline_type.
  • Now chip.pipeline_type has three choices tf, histone and control.
  • For control mode, do not use inputs prefixed with ctl_*. Instead, define inputs in non-ctl variables. e.g. define control FASTQs in chip.fastqs_rep1_R1 (not in chip.ctl_fastqs_rep1_R1).

Bug fixes

  • Clip peak's genome coordinate between 0 and chromSize.
    • Affected files: SPP/MACS2 peak and IDR/IDR_unthresholded/overlap peak.

Updated reference genome data

  • hg38: v1 -> v3
  • mm10: v1 -> v3

No update for old genome data: mm9, hg19. They are still at v1.

Reference genome dataset v3: ENCODE4 standard for ATAC/ChIP.

  • New TSS regions (tss) based on GENCODE annotation.
    • hg38: GENCODE v29
    • mm10: GENCODE vM21
  • Repacked other annotation BED files: no changes in actual contents.
    • Enhancer (enh)
    • Promoter (prom)
    • DHS regions (dnase)
  • New blacklist (blacklist) for hg38. Keep using old blacklist for mm10.

v1.4.0.1

11 Apr 05:19
e2a698d
Compare
Choose a tag to compare

IMPORTANT: Update Caper >= 0.8.
$ pip install caper --upgrade

IMPORTANT: Conda users must update pipeline's Conda env.
$ bash scripts/update_conda_env.sh

New control subsampling

  • Controlled by chip.ctl_depth_limit and chip.exp_ctl_depth_ratio_limit. There are two limits calculated from each parameter. Pipeline takes a maximum of it max(ctl_depth_limit, exp_ctl_depth_ratio_limit * exp_rep_read_depth) and if control is deeper than that then control is subsampled to that limit.
  • chip.ctl_depth_limit: Hard limit on control's read depth. 200M by default.
  • chip.exp_ctl_depth_ratio_limit: Factor to be multiplied to experiment replicate's read depth. 5.0 by default.
  • We still keep control subsampling controlled by a parameter chip.ctl_subsample_reads.
    • Both raw/filtered control BAMs have full reads. Filtered (nodup) control BAM is converted into control TAG-ALIGN and then TAG-ALIGN is subsampled down to chip.ctl_subsample_reads (if it is defined >0). This parameter modifies TA itself so affects all downstream analyses like peak-calling and also the new automatic control subsampling, which is done in task call_peak.

Cropping FASTQs: Added a parameter chip.crop_length_tol, which defines a tolerance to allow shorter reads around the crop_length. It's 2 by default and only works when chip.crop_length is defined (>0). Trimmomatic's parameters CROP and MINLEN will be chip.crop_length and chip.crop_length - abs(chip.crop_length_tol), respectively. Output (cropped FASTQ) filename will be PREFIX.crop_${CROP}-${TOELRANCE}bp.fastq.gz where TOLERANCE = CROP - MINLEN.

  • All reads longer (>) than chip.crop_length will be cropped.
  • All reads shorter (<) then chip.crop_length - abs(chip.crop_length_tol) will be removed.
  • All reads not shorter (>=) then chip.crop_length - abs(chip.crop_length_tol) and not longer (<=) than chip.crop_length will be kept.

Java heap

  • For tasks with Java app running inside. If the following parameters are not explicitly defined by a user, each Java app in a task uses 90% of corresponding task memory, so that it does not go over physical memory of cloud instance. For example, if user didn't define chip.filter_picard_java_heap and then pipeline will use 90% of chip.filter_mem_mb for Java heap -Xmx (for picard tools in filter task).
    • chip.align_trimmomatic_java_heap
    • chip.filter_picard_java_heap
    • chip.gc_bias_picard_java_heap

Bug fixes

  • Subsampling TAG-ALIGN (for PE dataset only)
    • PE subsampling task actually subsampled 2 x chip.subsample_reads reads.
  • Default settings of the pipeline is not affected by this bug.
    • Affected cases:
      • chip.subsample_reads > 0 (0 by default) and chip.paired_end == True and actual number of reads in replicate is > chip.subsample_reads.
      • chip.ctl_subsample_reads > 0 (0 by default) and chip.ctl_paired_end == True and actual number of reads in control is > chip.ctl_subsample_reads.
      • If users starts from types (e.g. BAM, NODUP-BAM, TA) other than FASTQ and chip.paired_end == True and actual number of reads in replicate is > chip.xcor_subsample_reads (15M by default).
  • Fix grep error on OSX.
  • Swapped lines in chip.croo.v4.json.
  • Cannot start from BAMs on DNAnexus (using Web UI).
  • JSD didn't work without a blacklist.
  • Pooled TAG-ALIGN had a fixed prefix "basename_prefix".
  • Croo task graph got complicated due to diamond dependency problem of task choose_ctl.

v1.3.6

28 Jan 01:09
209a71d
Compare
Choose a tag to compare

Conda users should re-install pipeline's environment.

$ bash scripts/uninstall_conda_env.sh
$ bash scripts/install_conda_env.sh

DNAnexus web-interface users should use workflows suffixed with -dockerhub. i.e. v1.3.6-dockerhub.

  • New parameters (in an input JSON)
    • chip.crop_length: Crop FASTQs with Trimmomatic. Cropping is disabled by default (set as 0). Check your FASTQs' read length first. Any reads SHORTER than this length will be excluded while cropping, hence not included in output BAMs and all downstream analyses.
    • chip.fdr_thresh: FDR threshold for SPP peak caller. It's 0.01 by default. Use a more relaxed value if you see the following File is empty error in SPP task call-peak. Possible fix for issue #119.
	Traceback (most recent call last):
	  File "/root/miniconda3/envs/encode-chip-seq-pipeline/bin/encode_task_spp.py", line 103, in <module>
	    main()
	  File "/root/miniconda3/envs/encode-chip-seq-pipeline/bin/encode_task_spp.py", line 94, in main
	    assert_file_not_empty(rpeak)
	  File "/root/miniconda3/envs/encode-chip-seq-pipeline/bin/encode_lib_common.py", line 212, in assert_file_not_empty
	    raise Exception('File is empty ({}). Help: {}'.format(f, help))
	Exception: File is empty (rep2-R1.subsampled.50.merged.nodup.pr2_x_ctl_for_rep2.300K.regionPeak.gz). Help:
  • Changes in parameters

    • chip.xcor_pe_trim_bp -> chip.xcor_trim_bp: _pe_ is misleading since it's also applied to SE FASTQs too.
  • Bug fixes

    • Ungzipped single FASTQ input
    • DNAnexus: Failure at read_genome_tsv due to errors while retrieving image from quay.io. dockerhub is used instead.

v1.3.5.1

13 Jan 16:51
a7fd35c
Compare
Choose a tag to compare

Conda users need to re-install pipeline's Conda env.

$ bash scripts/uninstall_conda_env.sh
$ bash scripts/install_conda_env.sh

Output file name change

  • Pooled TAG-ALIGN file will have a fixed prefix of rep.pooled instead of using rep1's prefix.

Change in default parameters

  • chip.filter_picard_java_heap: 4G to dynamic (chip.filter_mem_mb)
  • chip.gc_bias_picard_java_heap: 6G to 10G

Troubleshooting for failed pipelines

  • Added some help texts for stringent IDR threshold

Added important GNU apps to Conda env

  • tar: to sort by filename tar --sort
  • grep: to use Perl style regular expression grep -P

Misc.

  • Downgraded Java version 11 -> 8 in docker/singularity images.
  • Output def JSON file for Croo: v3 released.
  • bowtie2 log is directly printed to STDOUT instead of printing to .align.log
  • removed wrong arrow (FASTQ R2 -> chip.xcor) in task graph in Croo's HTML report.

v1.3.4

16 Nov 02:04
61f77dd
Compare
Choose a tag to compare

IMPORTANT: Update Caper and Croo for a task graph in a Croo HTML report. Old Croo will not work with old pipeline's metadata.json.

$ pip install --upgrade caper croo 

IMPORTANT: Conda users must update their environment.

$ bash scripts/update_conda_env.sh

Task graph on Croo report.

  • updated the output definition JSON file on pipeline's side.

Default parameter changes

  • chip.macs2_signal_track_disks: 200 GB -> 400 GB
    • to prevent possible PAPI error 10 error.

WDL

  • For a task macs2_signal_track_disks preemption is now allowed on GCP.

Bug fixes

  • updated documentation for Conda installation for OSX users.
    • OSX users need to install GNU grep.

v1.3.3

01 Nov 00:13
c238150
Compare
Choose a tag to compare

IMPORTANT: Conda users must re-install Conda env.

$ bash scripts/uninstall_conda_env.sh
$ bash scripts/install_conda_env.sh

New parameters to control JAVA max heap (java -Xmx)

  • should be helpful for issue #88
  • added the following 2 parameters
    • chip.filter_picard_java_heap: 4G by default (for Picard MarkDuplicate)
    • chip.gc_bias_picard_java_heap: 6G by default (for Picard CollectGcBiasMetrics)

Bug fixes

  • to merge two blacklists with different number of columns (3 and 6).

Croo

  • presigned URLs for organized outputs
    • They are PUBLIC. Use this at your own risk.
  • added UCSC browser tracks.
    • bigWig: MACS2 signal tracks (p-val and fold-enrich).
    • bigBed: optimal/conservative idr/overlap peaks.

Change of default parameters

  • chip.align_disks: 200 GB -> 400 GB

Removed old method

  • completely removed old method.
  • Users must use Caper to run pipelines.

v1.3.2

23 Oct 21:27
94e0237
Compare
Choose a tag to compare

IMPORTANT: Conda users must update pipeline's Conda environment (not a re-installation). This will just update pipeline's python task wrappers.

$ bash scripts/update_conda_env.sh

New feature

  • Removed a parameter chip.keep_irregular_chr from the pipeline.
  • Added a genome-specific parameter regex_bfilt_peak_chr_name instead (chr[\dXY]+ by default, which means chr1, chr2, ... , chrX and chrY). You can define this either in a genome TSV (e.g. keep_irregular_chr[TAB]chr[\dXY]) or in your input JSON (e.g. `"chip.keep_irregular_chr": "chr[\dXY]+") .
    • This parameter defines chromosomes to keep in the final (with .bfilt. suffix) peaks file. This filter is applied even without a blacklist.

Bug fixes

  • Pipeline can catch non-zero error code correctly at failed tasks (tasks align and call_peak).
  • Pipeline can run without a blacklist.

Dependencies

  • Added wget and curl to Conda environment.

Genome data

  • Genome database builder generates md5sum-same TAR balls.
  • Can use gzipped Bowtie2/BWA indices (.tar.gz) with arbitrary filenames.
    • Files in a TAR ball don't need to be prefixed with TAR ball's filename prefix.

v1.3.1

12 Oct 04:31
c8ac1b2
Compare
Choose a tag to compare

IMPORTANT: Conda users must re-install Conda environment.

$ bash scripts/uninstall_conda_env.sh
$ bash scripts/install_conda_env.sh

Added latest python3 MACS2 2.2.4 to Conda env

  • removed py2 one from py2 Conda env
  • this upgrade slightly changes output of MACS2. so next version will be 1.4.0.

Added missing deps to Conda env

  • ghostscript: to fix gs error on Stanford Sherlock cluster
  • caper and croo: for user's convenience when pipeline's Conda env (py3) is activated. We have PYTHONNOUSERSITE env var set up in pipeline's Conda env so user's locally pip-installed caper, croo are ignored when it's activated.

Fix for issue #91

  • removed all file-linking (soft/hard) from the pipeline

v1.3.0

03 Oct 09:18
c09a1a7
Compare
Choose a tag to compare
  • update for Conda users

    IMPORTANT: Conda users must uninstall old pipeline's Conda environment (scripts/uninstall_conda_env.sh) and re-install it (scripts/install_conda_env.sh). Pipeline now supports old (<4.7) and new (>=4.7) Conda versions.

    • pipeline now supports Conda >= 4.7. please follow (carefully) the conda installation instruction on README.
    • pipeline's base Conda environment is now based on python3 (instead of python2) so users must re-install pipeline's Conda environment. please follow the above instruction.
  • update for Google Cloud Platform (GCP) users

    • we will keep old naming google for GCP but it's recommened to use gcp instead. For example, use hg38_gcp.tsv instead of hg38_google.tsv for genome TSV file.
  • moved files for old method to dev/

    • will deprecate this soon. please use Caper. old method has known issues unfixed.
  • blacklist filtering in JSD (Jenshen-Shannon Distance) calculation

    • deeptools's native blacklist filtering turns out to be very slow for a blacklist BED with lines >= 1000. See this for details.
    • we make a temporary blacklist-filtered BAM using bedtools intersect and use it for deeptools plotFingerprint.
  • upgraded genomic softwares in Conda env/docker container:

    • Conda environment is now based on python3. additional python2 environment for packages that are sill in py2 (MACS2, metaseq)
    • update software versions
    • python 2.7 -> 3.6.6
    • samtools 1.2 -> 1.9 (both backward/forward compatibility for command lines)
    • phantompeakqualtools 1.2 -> 1.2.1 (to remove negative peaks)
    • deeptools 2.5.4 -> 3.3.1 (to print out synthetic JSD for samples without controls)
    • picard 2.10.6 -> 2.20.7
    • r 3.3.2 -> 3.2.2 (had to downgrade to support Conda >= 4.7 because free channels with high R version and python 3.6.6 are not allowed) we keep R version at 3.4.4 (with r-spp 1.15) in a docker container though. It was not possible to match R versions between Conda env and docker container.
    • bowtie2 2.2.6 -> 2.3.4.3
    • bwa 0.7.13 -> 0.7.17
    • bedtools 2.26.0 -> 2.29.0
  • change of parameters

    • chip.regex_filter_reads (String) is replaced with chip.filter_chrs (Array[String])
    • e.g. to remove a mito-chr MT. input JSON should have `"chip.filter_chrs": [ 'MT' ]"
    • chip.filter_chrs is [] by default, i.e. For ChIP-seq, we keep mito chrs in a filtered BAM by default.
    • resource parameter name change
      • chip.call_peak_* are shared by both peak callers MACS2 and SPP
      • chip.call_align_* are shared by both aligners BOWTIE2 and BWA
      • chip.bowtie2_mem_mb and chip.bwa_mem_mb -> chip.align_mem_mb
      • chip.bowtie2_cpu and chip.bwa_cpu -> chip.align_cpu
      • chip.bowtie2_disks and chip.bwa_disks -> chip.align_disks
      • chip.macs2_mem_mb and chip.spp_mem_mb -> chip.call_peak_mem_mb
      • chip.macs2_disks and chip.spp_disks -> chip.call_peak_disks
      • chip.spp_cpu -> chip.call_peak_cpu
  • change in default parameters

    • mapq_thresh: Default: 30 for bwa aligner, 30 for bowtie2 aligner. you can still define any mapq_thresh.
  • added bowtie2 as a new DEFAULT aligner

    • bowtie2 is a new DEFAULT aligner
    • users can still use bwa instead bowtie2 by setting a param in an input JSON file: chip.aligner: bwa
  • use SAMstats instead of samtools flagstat

    • better read counting for raw/filtered BAMs
  • can use custom aligner/peak_caller

    • specify your custom aligner code custom_align_py and index TAR file custom_aligner_idx_tar
    • specify your custom peak caller custom_call_peak_py
    • for custom genomes, it's recommended to use genome data builder to build custom_aligner_idx_tar and custom_aligner_mito_idx_tar.
  • added QC for GC bias

    • GC bias plot is added to align section of the HTML report. you can disable it by setting flag chip.enable_gc_bias to false in your input JSON.
  • multiple blacklists

    • added blacklist2. users can define chip.blacklist2 in an input JSON file or add a new row (blacklist2[TAB][YOUR_2ND_BLACKLIST]) in a genome genome TSV file.
    • multiple blacklists will be merged with zcat command
  • better readability in QC report/JSON

    • organized outputs in big categories: align, lib_complexity, replcation, align_enrich, peak_enrich and etc.
    • in all docs and HTML reports and QC JSON files, replaced confusing abbreviations with comprehensive ones.
      • pprY -> pooled-prY
      • ppr -> pooled-pr1_vs_pooled_pr2
      • repX-pr -> repX-pr1_vs_repX-pr2
  • fixed bug

    • removed python multiprocessing from all wrappers: to minimize memory error for SLURM + Singularity
    • replaced all sambamba with samtools in command lines since sambamba has some seg-fault issues.
    • do not make read length log for R2 (for PE). GCP backend somtimes pick wrong read length file.
    • matplotlib X server error
    • numpy conflict in metaseq
    • SAMstats multiprocess error (downgrading SAMstats from 0.2.2 to 0.2.1)
    • Index TAR file unpacking issue in docker/singularity container (ownership problem)
    • MACS2 10th column== -1 issue
    • py2->3 formatting issue in a subsampled filename (15.0M -> 15M)
    • upgraded MACS2 to 2.1.3.3 to remove spurious spikes in peaks.
    • hard-linkg problems in reproducibility step.

v1.2.2

15 Jun 04:31
844bec3
Compare
Choose a tag to compare

WARNING: Conda users must update their Conda environments.

$ bash conda/update_conda_env.sh
  • mixed endedness per replicate. For example of three replicates with mixed endedness. Similarly for controls (ctl_paired_ends).

     {
     	"chip.paired_ends": [false, true, false]
     }
  • mixed data type per replicate. For example of three replicates with mixed data type (rep1: BAM, rep2: NODUP_BAM, rep3: TAG-ALIGN). Similarly for controls (ctl_bams, ...).

     {
     	"chip.bams": ["rep1.bam", null, null]
     	"chip.nodup_bams": [null, "rep2.nodup.bam", null]
     	"chip.tas": [null, null, "rep3.tagAlign.gz"]
     }
  • no auto-installation for croo/caper inside py3 conda env

  • removed resumer support

    • instead, Caper (Cromwell's native call-caching) is recommended
  • bug fixes

    • qc_report fails due to type coercion (File -> File?) of outputs from idr/overlap
    • fix for unreplicated pipeline failing on DNAnexus