This workflow allows you to extract variants and samples that comply to both a set of genotype and functional annotation filters, by intersecting the genotype VCFs with the functional annotation VCFs.
The pipeline has the following main processes:
- FIND_CHUNK: finds the genomic and functional annotation agg chunks of interest.
- EXTRACT_VARIANT_VEP: filters the annotation agg vcfs.
- INTERSECT_ANNOTATION_GENOTYPE_VCF: intersects the genomic vcf with the filtered annotation vcf.
- FIND_SAMPLES: finds samples of interest.
- SUMMARISE_OUTPUT: produces summary tables.
This is a region file of your genes of interest. This must be a three or column tab-delimited file of chromosome, start, and stop (with an option fourth column of an identifier - i.e. a gene name). The file should have the .bed extension.
Example of input_bed
file:
chr2 213005363 213151603 IKZF2
chr7 50304716 50405101 IKZF1
This is the list of chunk names and full file paths to both the genotype and functional annotation VCFs for either aggV2 or aggCOVID. These can be found under
GEL data resources > aggregate_file_lists > aggV2_chunk_names.bed
and
GEL data resources > aggregate_file_lists > aggCOVID_4.2_chunk_names.bed
This parameter defines whether to include (set to -i
) or to exclude (set to -e
) the sites selected using the --expression
parameter (see below).
This parameter defines the bcftools filter of your query. See bcftools EXPRESSIONS
for accepted filters https://samtools.github.io/bcftools/bcftools.html#expressions.
This parameter defines the format of the query, see https://samtools.github.io/bcftools/bcftools.html#query for details. For the process to run, you should add the following fields '[%SAMPLE\t%CHROM\t%POS\t%REF\t%ALT\n]'
, but you can also specify additional fields after the initial list.
Number of cpus to be used by each nextflow process. The default is set to 1 cpu per process, but when using and input_bed
file with > 5 entries please set it to a higher value.
Total RAM available for each nextflow process. The default is set to 2.GB per process, but when using and input_bed
file with > 5 entries please set it to a higher value.
This file lists the severity of variants. It can be found under
GEL data resources > aggregations > gel_mainProgramme > somAgg > v0.2 > additional data > vep severity scale > VEP_severity_scale_2020.txt
. Provide this file if interested only in variant with a specific consequence.
With this parameter we choose the severity of variants we are interested in for our query. For example, if you want look only at missense variants or worse, the input value would be missense
. Only use if the parameter severity_scale
is set.
This workflows produces three ouputs for each gene in your input bed file.
*_result.tsv
file: this is a tab-delimited output frombcftools query
command.*_platekey_summary.tsv
file: this is a two-column tab-delimited file, where one column is the list of platekeys recovered by the query, and the second column is the number of variants per each participant that satisfied the query.*_variant_summary.tsv
file: this is a two-column tab-delimited file, where one column is the list of variants that satisfied the query, and the second column is the number of participants that have that query.
An example question would be: "I want to extract the samples in aggV2 who are homozygous alt for missense (or worse) rare variants within the gene IKZF1".
The final command would look like this:
An example question would be: "I want to extract the samples in aggV2 who are homozygous alt for any type of variant within the gene IKZF1".
The final command would look like this: