Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Back to dev #13

Merged
merged 2 commits into from
Jan 8, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,11 @@ null/
.nf-test
.nf-test*
.nf-test/*

.vscode
.vscode/*

tests/unmergedgvcfs
tests/unmergedgvcfs/*
tests/input-full-ncgm.csv
conf/test_full_ncgm.config
5 changes: 3 additions & 2 deletions .nf-core.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ lint:
- docs/images/nf-core-vcftomat_logo_dark.png
- .github/ISSUE_TEMPLATE/bug_report.yml
included_configs: false
actions_ci: false
multiqc_config:
- report_comment
nextflow_config:
Expand All @@ -30,7 +31,7 @@ lint:
nf_core_version: 3.1.0
repository_type: pipeline
template:
author: "Famke B\xE4uerle, Dorothy Ellis"
author: "Famke Bäuerle, Dorothy Ellis"
description: Nextflow pipeline to convert (g)vcfs to matrices suitable for statistical
analysis
force: false
Expand All @@ -43,4 +44,4 @@ template:
- codespaces
- fastqc
- adaptivecard
version: 1.0.0dev
version: 1.1.0
12 changes: 12 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,18 @@
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## v1.1.0 - Newton Puccoon - 08.01.2025

### Added

- [#7](https://github.com/qbic-pipelines/vcftomat/pull/7) - samplenames to columns
- [#8](https://github.com/qbic-pipelines/vcftomat/pull/8) - concat for sample, label pairs

### Fixed

- [#5](https://github.com/qbic-pipelines/vcftomat/pull/5) - filename collision
- [#10](https://github.com/qbic-pipelines/vcftomat/pull/10) - prepare release 1.1.0

## v1.0.0 - Curie Purpureal - 16.12.2024

Initial release of qbic-pipelines/vcftomat, created with the [nf-core](https://nf-co.re/) template.
24 changes: 13 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# qbic-pipelines/vcftomat

[![GitHub Actions CI Status](https://github.com/qbic-pipelines/vcftomat/actions/workflows/ci.yml/badge.svg)](https://github.com/qbic-pipelines/vcftomat/actions/workflows/ci.yml)
[![GitHub Actions Linting Status](https://github.com/qbic-pipelines/vcftomat/actions/workflows/linting.yml/badge.svg)](https://github.com/qbic-pipelines/vcftomat/actions/workflows/linting.yml)[![Cite with Zenodo](http://img.shields.io/badge/DOI-10.5281/zenodo.XXXXXXX-1073c8?labelColor=000000)](https://doi.org/10.5281/zenodo.XXXXXXX)
[![GitHub Actions Linting Status](https://github.com/qbic-pipelines/vcftomat/actions/workflows/linting.yml/badge.svg)](https://github.com/qbic-pipelines/vcftomat/actions/workflows/linting.yml)[![Cite with Zenodo](http://img.shields.io/badge/DOI-10.5281/zenodo.14616650-1073c8?labelColor=000000)](https://doi.org/10.5281/zenodo.14616650)
[![nf-test](https://img.shields.io/badge/unit_tests-nf--test-337ab7.svg)](https://www.nf-test.com)

[![Nextflow](https://img.shields.io/badge/nextflow%20DSL2-%E2%89%A524.04.2-23aa62.svg)](https://www.nextflow.io/)
Expand All @@ -16,9 +16,11 @@

1. Indexes (g.)vcf files ([`tabix`](http://www.htslib.org/doc/tabix.html))
2. Converts g.vcf files to vcf with `genotypegvcf` ([`GATK`](https://gatk.broadinstitute.org/hc/en-us))
3. Merges all vcfs from the same sample with `bcftools/merge` ([`bcftools`](https://samtools.github.io/bcftools/bcftools.html))
4. Converts the (merged) vcfs to a matrix using a custom R script written by @ellisdoro ([`R`](https://www.r-project.org/))
5. Collects all reports into a MultiQC report ([`MultiQC`](http://multiqc.info/))
3. Concatenates all vcfs that have the same id and the same label with `bcftools/concat` ([`bcftools`](https://samtools.github.io/bcftools/bcftools.html))
4. Changes the sample name in the vcf file to the filename with `bcftools/reheader` ([`bcftools`](https://samtools.github.io/bcftools/bcftools.html)) - This can be turned off by adding `--rename false` to the `nextflow run` command.
5. Merges all vcfs from the same sample with `bcftools/merge` ([`bcftools`](https://samtools.github.io/bcftools/bcftools.html))
6. Converts the (merged) vcfs to a matrix using a custom R script written by @ellisdoro ([`R`](https://www.r-project.org/))
7. Collects all reports into a MultiQC report ([`MultiQC`](http://multiqc.info/))

![](./docs/images/vcftomat.excalidraw.png)

Expand All @@ -32,13 +34,14 @@ First, prepare a samplesheet with your input data that looks as follows:
`samplesheet.csv`:

```csv
sample,gvcf,vcf_path,vcf_index_path
SAMPLE-1,false,path/to/vcf.gz,path/to/.vcf.gz.tbi
SAMPLE-1,false,path/to/vcf.gz,path/to/.vcf.gz.tbi
SAMPLE-2,true,path/to/g.vcf.gz,path/to/g.vcf.gz.tbi
sample,label,gvcf,vcf_path,vcf_index_path
SAMPLE-1,pipelineA-callerA,false,path/to/vcf.gz,path/to/.vcf.gz.tbi
SAMPLE-1,pipelineB-callerA,false,path/to/vcf.gz,path/to/.vcf.gz.tbi
SAMPLE-2,pipelineB-callerB,true,path/to/g.vcf.gz,path/to/g.vcf.gz.tbi
SAMPLE-2,pipelineB-callerB,true,path/to/g.vcf.gz,path/to/g.vcf.gz.tbi
```

Each row represents a VCF file coming from a sample. The `gvcf` column indicates whether the file is a g.vcf file or not. The `vcf_path` and `vcf_index_path` columns contain the path to the VCF file and its index, respectively.
Each row represents a VCF file coming from a sample. The `label` column enables concatenation of vcfs (for example when the pipeline produces different vcfs for chrM and chrY). The `gvcf` column indicates whether the file is a g.vcf file or not. The `vcf_path` and `vcf_index_path` columns contain the path to the VCF file and its index, respectively.

Now, you can run the pipeline using:

Expand All @@ -65,8 +68,7 @@ If you would like to contribute to this pipeline, please see the [contributing g

## Citations

<!-- TODO nf-core: Add citation for pipeline after first release. Uncomment lines below and update Zenodo doi and badge at the top of this file. -->
<!-- If you use qbic-pipelines/vcftomat for your analysis, please cite it using the following doi: [10.5281/zenodo.XXXXXX](https://doi.org/10.5281/zenodo.XXXXXX) -->
If you use qbic-pipelines/vcftomat for your analysis, please cite it using the following doi: [10.5281/zenodo.14616650](https://doi.org/10.5281/zenodo.14616650)

<!-- TODO nf-core: Add bibliography of tools and data used in your pipeline -->

Expand Down
4 changes: 2 additions & 2 deletions assets/multiqc_config.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
report_comment: >
This report has been generated by the <a href="https://github.com/qbic-pipelines/vcftomat/releases/tag/1.0.0" target="_blank">qbic-pipelines/vcftomat</a>
analysis pipeline.
This report has been generated by the <a href="https://github.com/qbic-pipelines/vcftomat/releases/tag/1.1.0"
target="_blank">qbic-pipelines/vcftomat</a> analysis pipeline.
report_section_order:
"qbic-pipelines-vcftomat-methods-description":
order: -1000
Expand Down
9 changes: 5 additions & 4 deletions assets/samplesheet.csv
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
sample,gvcf,vcf_path,vcf_index_path
SAMPLE-1,false,path/to/vcf.gz,path/to/.vcf.gz.tbi
SAMPLE-1,false,path/to/vcf.gz,path/to/.vcf.gz.tbi
SAMPLE-2,true,path/to/g.vcf.gz,path/to/g.vcf.gz.tbi
sample,label,gvcf,vcf_path,vcf_index_path
SAMPLE-1,pipelineA-callerA,false,path/to/vcf.gz,path/to/.vcf.gz.tbi
SAMPLE-1,pipelineB-callerA,false,path/to/vcf.gz,path/to/.vcf.gz.tbi
SAMPLE-2,pipelineB-callerB,true,path/to/g.vcf.gz,path/to/g.vcf.gz.tbi
SAMPLE-2,pipelineB-callerB,true,path/to/g.vcf.gz,path/to/g.vcf.gz.tbi
8 changes: 7 additions & 1 deletion assets/schema_input.json
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,12 @@
"errorMessage": "Sample name must be provided and cannot contain spaces",
"meta": ["id"]
},
"label": {
"type": "string",
"pattern": "^\\S+$",
"errorMessage": "Label must be provided and cannot contain spaces",
"meta": ["label"]
},
"gvcf": {
"type": "boolean",
"errorMessage": "",
Expand Down Expand Up @@ -40,6 +46,6 @@
"errorMessage": "Index of VCF file must have extension '.tbi'- Optional"
}
},
"required": ["sample", "gvcf", "vcf_path"]
"required": ["sample", "label", "gvcf", "vcf_path"]
}
}
33 changes: 29 additions & 4 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -23,13 +23,38 @@ process {
}

withName: 'GATK4_GENOTYPEGVCFS' {
ext.prefix = { "${input.baseName.tokenize('.')[0]}" }
ext.prefix = { "${meta.name}" }
}

withName: 'BCFTOOLS_CONCAT' {
memory = 8.GB
ext.prefix = { "${meta.label}.concat" }
ext.args = { " --allow-overlaps --output-type z --write-index=tbi" }
publishDir = [
mode: params.publish_dir_mode,
path: { "${params.outdir}/bcftools/concat/" },
]
}

withName: 'BCFTOOLS_REHEADER' {
beforeScript = { "echo ${meta.label} > ${meta.label}.txt" }
ext.args = { "--samples ${meta.label}.txt" }
ext.prefix = { "${meta.label}.reheader" }
ext.args2 = { "--output-type z --write-index=tbi" }
publishDir = [
mode: params.publish_dir_mode,
path: { "${params.outdir}/bcftools/reheader/" },
]
}

withName: 'BCFTOOLS_MERGE' {
memory = 8.GB
ext.args = { '--force-samples' }
ext.prefix = { "${meta.id}.merged" }
memory = 8.GB
ext.args = { "--force-samples --output-type z --write-index=tbi" }
ext.prefix = { "${meta.id}.merge" }
publishDir = [
mode: params.publish_dir_mode,
path: { "${params.outdir}/bcftools/merge/" },
]
}

withName: 'MULTIQC' {
Expand Down
Binary file modified docs/images/vcftomat.excalidraw.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
20 changes: 18 additions & 2 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,27 +6,43 @@ This document describes the output produced by the pipeline. Most of the plots a

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

<!-- TODO nf-core: Write this documentation describing your workflow's output -->

## Pipeline overview

The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:

- [Tabix](#tabix) - Indexes (g.)vcf files
- [GenotypeGVCFs](#genotypegvcfs) - Converts g.vcf files to vcf with GATK
- [Concatenate VCFs](#concatenate-vcfs) - Concatenates all vcfs that have the same id and the same label with bcftools/concat
- [Rename Samples](#rename-samples) - Changes the sample name in the vcf file to the label with bcftools/reheader
- [Merge VCFs](#merge-vcfs) - Merges all vcfs from the same sample with bcftools/merge
- [Convert to matrix](#convert-to-matrix) - Converts the (merged) vcfs to a matrix using a custom R script written by @ellisdoro
- [MultiQC](#multiqc) - Aggregate report describing results and QC from the whole pipeline
- [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution

### Tabix

Tabix generated index files with `.tbi` extension for all `(g).vcf` files that are given to the pipeline without index.

### GenotypeGVCFs

The GATK GenotypeGVCFs module translates genotype (g) vcf files into classic vcf files. The key difference between a regular VCF and a GVCF is that the GVCF has records for all sites, whether there is a variant call there or not.

### Concatenate VCFs

Some variant calling pipelines will return multiple (g)VCF files for one patient. The `concatenate` function of `bcftools` is used to add these VCFs to one VCF.

### Rename Samples

To make enable the comparison of the finalized CSV files, `bcftools reheader` can be enabled to rename the variant sample name from the generic name given by the variant caller to a custom label given with the samplesheet.

### Merge VCFs

To enable comparison of different variant callers or variant calling pipelines, all VCFs that come from the same sample are merged based on the sample ID submitted by the user.

### Convert to matrix

A custom R script is used to convert the finalized VCF to a CSV which can be used for further downstream analysis. Script was written by [Dorothy Ellis](https://github.com/ellisdoro).

### MultiQC

<details markdown="1">
Expand Down
19 changes: 10 additions & 9 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,15 +19,17 @@ You will need to create a samplesheet with information about the samples you wou
The `sample` identifiers have to be the same when the vcfs originate from the same bam but were yielded with different callers. The pipeline will merge all vcfs from the same sample into one vcf file but is also able to handle if there is only one vcf file for a sample (merging will then be skipped).

```csv title="samplesheet.csv"
sample,gvcf,vcf_path,vcf_index_path
SAMPLE-1,false,path/to/vcf.gz,path/to/.vcf.gz.tbi
SAMPLE-1,false,path/to/vcf.gz,path/to/.vcf.gz.tbi
SAMPLE-2,true,path/to/g.vcf.gz,path/to/g.vcf.gz.tbi
sample,label,gvcf,vcf_path,vcf_index_path
SAMPLE-1,pipelineA-callerA,false,path/to/vcf.gz,path/to/.vcf.gz.tbi
SAMPLE-1,pipelineB-callerA,false,path/to/vcf.gz,path/to/.vcf.gz.tbi
SAMPLE-2,pipelineB-callerB,true,path/to/g.vcf.gz,path/to/g.vcf.gz.tbi
SAMPLE-2,pipelineB-callerB,true,path/to/g.vcf.gz,path/to/g.vcf.gz.tbi
```

| Column | Description |
| ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `sample` | Custom sample name. This entry will be identical for vcfs that originate from the same bam but were yielded with different callers. Spaces in sample names are automatically converted to underscores (`_`). |
| `label` | Label for the vcf file. This is used to concatenate vcfs with the same label. |
| `gvcf` | Boolean whether the supplied sample is a gvcf (true) or a normal vcf (false). |
| `vcf_path` | Full path to VCF file, should have the extension ".g.vcf.gz", ".vcf.gz", ".g.vcf" or ".vcf". |
| `vcf_index_path` | Full path to index of (g)VCF file. Optional. Should have extension ".tbi". |
Expand All @@ -39,7 +41,7 @@ An [example samplesheet](../assets/samplesheet.csv) has been provided with the p
The typical command for running the pipeline is as follows:

```bash
nextflow run qbic-pipelines/vcftomat --input ./samplesheet.csv --outdir ./results --genome GATK.GRCh38 -profile docker
nextflow run qbic-pipelines/vcftomat --input ./samplesheet.csv --outdir ./results --genome GATK.GRCh38 --rename true -profile docker
```

This will launch the pipeline with the `docker` configuration profile. See below for more information about profiles.
Expand Down Expand Up @@ -69,10 +71,9 @@ nextflow run qbic-pipelines/vcftomat -profile docker -params-file params.yaml
with:

```yaml title="params.yaml"
input: './samplesheet.csv'
outdir: './results/'
genome: 'GATK.GRCh38'
<...>
input: "./samplesheet.csv"
outdir: "./results/"
genome: "GATK.GRCh38"
```

You can also generate such `YAML`/`JSON` files via [nf-core/launch](https://nf-co.re/launch).
Expand Down
10 changes: 10 additions & 0 deletions modules.json
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,21 @@
"https://github.com/nf-core/modules.git": {
"modules": {
"nf-core": {
"bcftools/concat": {
"branch": "master",
"git_sha": "d1e0ec7670fa77905a378627232566ce54c3c26d",
"installed_by": ["modules"]
},
"bcftools/merge": {
"branch": "master",
"git_sha": "666652151335353eef2fcd58880bcef5bc2928e1",
"installed_by": ["modules"]
},
"bcftools/reheader": {
"branch": "master",
"git_sha": "666652151335353eef2fcd58880bcef5bc2928e1",
"installed_by": ["modules"]
},
"gatk4/genotypegvcfs": {
"branch": "master",
"git_sha": "1999eff2c530b2b185a25cc42117a1686f09b685",
Expand Down
5 changes: 5 additions & 0 deletions modules/nf-core/bcftools/concat/environment.yml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

59 changes: 59 additions & 0 deletions modules/nf-core/bcftools/concat/main.nf

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading
Loading