Extend functionality for outlier sample exclusion workflow #496
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR extends the existing outlier sample exclusion workflow,
wdl/FilterOutlierSamples.wdl
, in several respects:Accept one or more vcfs. This is necessary for large cohorts (e.g., gnomAD) where each chromosome is stored as a separate VCF for improved parallelization in the cloud, but we need to define outliers based on the sum of variants across all VCFs.
Add an optional VCF preprocessing step (with bcftools) prior to collecting sample counts. This is necessary in situations where we want to restrict to certain subsets of variants for defining outliers. Two empirical use cases from gnomAD v3 include: (a) restricting to rare (
AF
<1%), non-singleton (AC
>1) deletions between 300bp - 1kb to deal with the artifact deletion bump we sometimes see in various callsets, and (b) restricting toPASS
-only variants for defining our final set of samples at the very end of all QC & post-processing.Allow outlier samples to be defined on one or more independent subsets of samples within the same VCF. This is necessary when cohorts contain a mixture of samples with different properties (e.g., PCR+ vs. PCR-) and we want to fit their SV count distributions separately when defining outliers. These are optional inputs to the workflow; if no values of
sample_subset_prefixes
andsample_subset_lists
are provided, then outliers will be defined for all samples in the VCF together.Enable (optional) plotting of outlier distributions from within the workflow. I understand that there is a separate workflow for plotting outlier distributions (
PlotSVCountsPerSample.wdl
) and the intention is for users to optionally run that workflow after collecting per-sample counts, but from a convenience perspective I thought it would be useful to have plotting as an option of the main outlier workflow. If there is a design reason why this is undesirable, we can remove the plotting, but I know I have enjoyed having this convenience added for gnomAD.I have tested the above changes on gnomAD v3 (24 VCFs & ~120k samples) and can confirm that they work as expected. I have not tested a cohort with a single VCF but I believe it should work as long as the VCF is passed as an array of one element.