The format is based on Keep a Changelog and this project adheres to Semantic Versioning.
- Fixed some issues where indices were not created
- Updated the docs
- Revert VEP version to v105
- Added
watchpath
functionality to the pipeline. Add thewatch:
prefix to a file basename in the samplesheet and the pipeline will automatically wait for the file to be created in the--watchdir
directory (the lookup happens recursively)
- Bumped the minimal support nextflow version to
24.04.0
- Bumped all modules to the newest versions
- The pipeline now also outputs
csi
indices - Rename the
master
branch tomain
- Low coverage regions (regions with less than 5 reads) are no longer considered for variant calling
- Updated the pipeline to the new linting guidelines
- Removed
check_max
in favor ofresourceLimits
- Automap analysis should now give the correct output files for individuals.
- Haplotypecaller will not perform phasing by default now. This can still be turned back on using the
--hc_phasing
parameter. - Removed the WES and WGS profiles.
- Added UPDio for Uniparental Disomy detection in family samples. This introduces the
--updio
parameter to turn on this detection and--updio_common_cnvs
to supply a common CNVs file to UPDio. The family needs to contain at least one child with its mother and father. - Added docs built with MkDocs. See the documentation site here.
- Added AutoMap to find regions of homozygosity from human samples. This introduces the
--automap
parameter to turn on this feature and the--automap_repeats
,--automap_panel
and--automap_panel_name
parameters to configure AutoMap (see the parameters docs for more information)
- Updated all tests to use snapshots instead.
- Made the pipeline pluggable to enable the use of it in a meta pipeline.
- Fixed an issue with
igenomes
paths not being casted correctly to their corresponding parameter
- Updated to the nf-core template v2.13.0
- Updated all GATK modules to 4.5.0.0
- Moved the pipeline from https://github.com/CenterForMedicalGeneticsGhent/nf-cmgg-germline to https://github.com/nf-cmgg/germline
- VCF files created with
haplotypecaller
no have thehaplotypecaller
tag in the filename instead ofgatk4-joint
to keep naming consistent
- Set the default ensembl VEP version to 105.0 instead of using dynamic container fetching
- Added the
--output_suffix
parameter to add a custom suffix to the basename of the output files. - Implemented files for the alphamissense plugin of VEP.
- Added the
--only_pass
parameter to only output variants that have thePASS
flag in the FILTER column. (This is only applied when--filter
is also given) - Added the
--keep_alt_contigs
parameter. This will tell the pipeline to not filter out the alternate contigs, which will now be done by default. - Add dbsnp Ids to VCFs coming from vardict. This will be done automatically if a dbsnp VCF is given to the pipeline through the
--dbsnp
parameter.
- Updated the seqplorer profile so that the output filenames are correct for easy import
- Changed the separator in
--vcfanno_resources
to;
instead of,
to allow commas in glob patterns. - Removed the reheader step from the vardict subworkflow and added a simple sed substitution to the vardictjava module
vcf2db
now uses a python 2 environment to increase it's stability
- Added the
--callers
parameter to specify the variant caller to use. Currently onlyhaplotypecaller
andvardict
are supported. - Added the
vardict
variant caller. - Added the
--vardict_min_af
parameter to specify the minimum allele frequency forvardict
. This option is also available in the samplesheet asvardict_min_af
to set it dynamically per sample. - Added the
--output_genomicsdb
option to specify whether a GenomicsDB should be outputted or not. This will betrue
when usingonly_merge
. - Added
--normalize
options for decomposing and normalizing of variants after calling and genotyping. - Added
WGS
,WES
,SeqCap
,HyperCap
andseqplorer
profiles that can be used to set the default parameters for these types of runs.
- Refactored the pipeline to accomodate future additions of variant callers and genotypers
- Removed a lot of unnecessary bloat
- Improved GenomicsDBImport (can now be multithreaded and runs a lot faster). This will make very big runs more possible.
- Changed
coverage_fast
tomosdepth_slow
, reversing the effect of the parameter. By default mosdepth will now be run with--fast-mode
. This can be disabled using the newmosdepth_slow
parameter. - Automatically merge the regions that are within 150 bps of eachother for the variant calling. This way it's ensured that indel calling happens correctly.
- Fixed an issue with the outputting of the validation PNG files, now all three types of PNGs are outputted.
- Fixed a small issue where VCFs without a sample created by the callers could not be used by
bcftools concat
, these files will now be filtered from the input of the command. - Removed the
--maxentscan
parameter because this file is automatically present in the container
- Added the
--only_call
parameter. Specifying this parameter tells the pipeline to only do variant calling and skip all post-processing. This will only output the GVCFs and files created to help variant calling. - The samplesheet is now also in the output folder.
- Added an option
--only_merge
to tell the pipeline to create genomicsdbs and stop running there - Get regions from the GVCF instead of CRAM for joint genotyping. This removes the need to supply a CRAM file when a GVCF file has been used as input.
- Updated
nf-validation
to v0.2.1. - Updated the samtools/merge tool to the nf-core version. This increases the efficiency and disk space usage of the tool.
- Fixed an error where the truth VCFs caused a join error when the same sample was given multiple times
- Updated some outdated error messages
Changed the output directory structure to be more bcbio like
- Added support for the
nf-validation
plugin. - Haplotypecaller dragen mode will be automatically disabled when not using a dragstr model.
- Removed bedtools/jaccard
- Fixed some patterns in the parameter JSON schema (since they are actually used now)
- Fixed a breaking bug where mosdepth didn't output the callable regions (this makes v1.2.0 deprecated, please use v1.2.1 instead)
- Genomicsdbs aren't scattered now, this increases the precision of the analyis by almost 3% at the cost of a bit longer runtimes
- Actually do the validation on the output VCFs now instead the freshly called GVCFs
- Improved the efficiency of the VEP run by scattering more efficiently on the amount of variants instead of the chromosomes
- Added a
--coverage_fast <true/false>
flag which can be used to run mosdepth in fast mode. This flag will also make sure that only the quantized bed from mosdepth is present in the output directory for each WGS individual, otherwise it will output everything - Added the possibility to give GVCF files as inputs and immediately go to the joint-genotyping. This is especially useful for the cases where several samples should be combined. This way the variant calling doesn't need to be re-run. Beware though that a CRAM file should still be given to generate the BED files used for the scatter/gathering. The new header names are
gvcf
andtbi
wheregvcf
is used to give the GVCF andtbi
is used to give its index. - Added
bedtools jaccard
to the validation. - Added a Dockerfile which creates an image that is able to run a full pipeline run inside of it.
- Added better documentation
- Updated the scattering again: it now follows this workflow:
- Sort and merge overlapping intervals of given ROI BED files (WES only)
- Create a BED file with callable regions using mosdepth
- Intersect the callable regions BED with the ROI BED (WES only)
- Split the resulting BED file (or the callable regions BED for WGS) into evenly sized BED files (amount is specified with
--scatter_count
) - Run HaplotypeCaller in parallel using these regions
- Merge and sort the BED files of all individuals in a family
- Split the merged BED file into evenly sized BED files (amount is specified with
--scatter_count
times the family size) - Run GenomicsDBImport and GenotypeGVCFs in parallel using these regions
- Updated the resource requirements of GenomicsDBImport and GenotypeGVCFs to be more efficient (and more cluster friendly)
- Removed ReblockGVCFs (this wasn't worth it and we save the raw GVCFs)
- Added
--merge_distance <integer>
to decrease the amount of intervals passed to genomicsdbimport. Increase this value if GenomicsDBImport is running slow. - Renamed
--use_dragstr_model
to--dragstr
.
- Fixed a warning showing up when running with
--dragstr false
- Add
--infer
flag tosomalier relate
when no PED file is given
- Added a parameter for setting the splitting depth threshold
--split_threshold FLOAT
- Change the default splitting threshold to 0.2 instead of 0.3
- Set the default of
--validate
tofalse
- Fixed a bug with ensembl VEP. Filenames of the alt contigs should now have a
_alt
suffix instead of all alt contigs. - Added file-exist check to the
sdf
file - Fixed the scattering when using alt contigs
- BED file input is new optional (The regions are created from the FASTA index). Providing a BED file is still preffered for the most optimal runs.
- Added support for samples that aren't part of a family. Just leave the
ped
andfamily_id
input fields in the samplesheet empty for a sample to be treated like this. This sample will go through exactly the same workflow but will be emitted as a single-sample VCF. - Added
dump
functionality to lots of channels. - Added the
dbsnp
option toGATK HaplotypeCaller
. use--dbsnp
and--dbsnp_tbi
to supply these VCFs. - Added the
vcf_extract_somalier
subworkflow to the pipeline. This also creates PED files inferred from the input multi-sample VCF. - Added a validation subworkflow. All files that have a VCF in the
truth_vcf
column of the input samplesheet will be validated against this VCF. This can be turned off by supplying the--validate false
flag to the pipeline run.
- Improved the scatter/gather logic. This is now done with
goleft indexsplit
to define chunks of even coverage. The genotyping scattering now happens withbedtools makewindows
. This creates chunks of even regions from the merged BED files for the family. By passing a padding of about 20 bps to the genotype tools, we make sure all variants on the edges of these regions are also genotyped. Duplicates are removed later when runningbcftools concat
- Refactored a lot of the code to maintain the same style over the whole pipeline.
- Updated the minimum Nextlow version to
22.10.5
to make sure S3 staging works perfectly. - The
post_processing
subworklow has been renamed to the better suitingjoint_genotyping
subworkflow.reblockgvcf
has been moved togermline_variant_calling
and thefilter
andreheadering
has been moved to the main workflow. - Merging VCFs of the same family now happens with
GATK GenomicsDBImport
instead ofGATK MergeGVCFs
orbcftools merge
. This gives more reliable results. - Improved the handling of
vcfanno
- The PED headers can now be added to all the output VCFs that are part of a family instead of only those that were given a PED file as input. The PED file used is created using
somalier relate
. This feature can be turned on using the--add_ped true
argument. This doesn't happen by default.
- Fixed some issues when both the
ped
andfamily_id
were given for a sample. - Fixed the PED input for
rtgtools_pedfilter
(-9
isn't recognized as unknown by the tool. Now these will be automatically converted to0
before this tool) - Fixed issues with the DBsnp index not being created correctly
- Fixed wrongly formed joins and added checks for mismatches and duplicates
- Upgraded to
nf-core
v2.6 template
- Fixed the
ensemblvep
version (was 104.1 before and is now 105.0) - Updated the label of
gatk4/calibratedragstrmodel
toprocess_high
to match the requirements for bigger inputs
- Full release of the pipeline
Initial release of nf-cmgg/germline, created with the nf-core template.