This document describes the different outputs produced by the MGnify v6 amplicon-pipeline, which analyses raw amplicon sequencing reads to generate taxonomic assignments using closed-reference databases.
The pipeline functions on a per-run/sample basis, and the outputs are shaped in a similar way. While most of the outputs are on a per-run basis, the pipeline does generate some summaries into different top-level summary files which are aggregated from all of the input runs. Assuming all the runs are from the same study, these summary files can be considered study-level summaries. In this documentation, the different contents of both per-run and per-study outputs will be described.
There are six general categories of results, which are separated into six different output directories by the pipeline, and each successful run/sample should have all six of these directories:
├── qc
├── sequence-categorisation
├── amplified-region-inference
├── primer-identification
├── asv
└── taxonomy-summary
As these results are per-run, many of the outputs use a run ID as a prefix. For the purposes of this documentation, we are using the run ID ERR4334351
, which is a paired-end sequencing run. There are some slight differences in outputs when a run is single-end, which will be emphasised in italics, and use the run ID ERR1718805
instead.
The qc
directory contains output files related to the quality control steps of the pipeline, from the sanity checking performed by SeqFu
, to the quality filtering done by fastp
. The structure of the qc
directory contains five possible output files:
├── qc
├── ERR4334351.fastp.json
├── ERR4334351.merged.fastq.gz
├── ERR4334351_multiqc_report.html
├── ERR4334351_seqfu.tsv
└── ERR4334351_suffix_header_err.json
├── sequence-categorisation
├── amplified-region-inference
├── primer-identification
├── asv
└── taxonomy-summary
- ERR4334351.fastp.json: This
json
file contains summary output of thefastp
run, including how many reads and/or bases were filtered for various reasons. Thisjson
file is also used later byMultiQC
to generate its report files. - ERR4334351.merged.fastq.gz: This compressed
fastq
file contains the cleaned and merged reads from thefastp
run. Note: if the run is single-end, this file won't exist. - ERR1718805.fastp.fastq.gz: This compressed
fastq
file contains the cleaned reads from thefastp
run. This file can be seen as the equivalent of the previous merged file for single-end runs. - ERR4334351_multiqc_report.html: This
html
file contains theMultiQC
report for that run. It will combine outputs from three different tools into the report;fastp
,cutadapt
, andDADA2
. - ERR4334351_seqfu.tsv: This
tsv
file contains the output from the sanity checking performed bySeqFu
.SeqFu
makes a few checks of whether the given fastq files are correctly structured, and if this QC step fails, the contents of this file will indicate the reasons why. - ERR4334351_suffix_header_err.json: This
json
file contains the output from the sanity checking performed on the suffixes and headers of fastq files. It is expected that the fastq files ending with the suffix_1
should contain the/1
tag in the headers inside the file, and vice versa for the suffix_2
and tag/2
. Note: if the run is single-end, the header check should be to have no suffix, but still contain the/1
tag as is standard.
The sequence-categorisation
directory contains output files related to the extraction of rRNA reads using infernal/cmsearch
and the specific rRNA clans from the Rfam
reference database. The structure of the sequence-categorisation
directory contains three different categories of files:
├── qc
├── sequence-categorisation
├── ERR4334351.tblout.deoverlapped
├── ERR4334351_SSU.fasta
├── ERR4334351_SSU_rRNA_bacteria.RF00177.fa
└── ERR4334351_SSU_rRNA_archaea.RF01959.fa
├── amplified-region-inference
├── primer-identification
├── asv
└── taxonomy-summary
The output files of this directory are dynamic depending on the output, as the names and number of files will be different depending on what the reads matched to. To be specific, it depends on the matched rRNA amplicon type, which is usually one of:
- SSU
- LSU
- ITS
And it also depends on which Rfam clan the reads matched to, which can be one (or more) of:
- Bacteria, with Rfam ID RF00177
- Archaea, with Rfam ID RF01959
- Eukarya, with Rfam ID RF01960
- ERR4334351.tblout.deoverlapped: This
tblout
file contains the deoverlapped output from runninginfernal/cmsearch
, which describes the matching reads, including matching coordinates, confidence scores, and matching Rfam clan ID. - ERR4334351_SSU.fasta: This
fasta
file contains all of the matching sequences to a particular rRNA amplicon type, being in this case the SSU. As described previously, this file name would be different if a different amplicon was matched. - ERR4334351_SSU_rRNA_bacteria.RF00177.fa This
fasta
file contains the matching sequences of both a particular amplicon and a particular Rfam clan, in being in this case the bacterial SSU. This run has a similar file for archaeal SSU, ERR4334351_SSU_rRNA_archaea.RF01959.fa, which is a common combination. While this is the third and final output type in thesequence-categorisation
directory, as shown here you can have more than one file of this kind.
The amplified-region-inference
directory contains output files related to the inference of the amplified region. The structure of the amplified-region-inference
directory contains two different categories of files:
├── qc
├── sequence-categorisation
├── amplified-region-inference
├── ERR4334351.16S.V3-V4.txt
└── ERR4334351.tsv
├── primer-identification
├── asv
└── taxonomy-summary
The output files of this directory are also dynamic for similar reasons as sequence-categorisation
- it depends on which, and how many, amplified regions were found. The pipeline allows for at most two amplified regions, which are made up of two parts:
- The gene: either 16S or 18S.
- The hypervariable region: any region from V1 to V9, and any logical pair of regions e.g. V3-V4.
- ERR4334351.16S.V3-V4.txt: This
txt
file contains the headers of reads that were found to match a particular amplified region, being in this case the V3-V4 region of the 16S gene. As described previously, the pipeline allows for at most two amplified regions, and therefore up to two files of this type, with the naming being dynamic. - ERR4334351.tsv: This
tsv
file contains a summary of the output of the amplified region inference module. From zero to two regions, this file summarises the findings.
The primer-identification
directory contains output files related to the automatic identification and trimming of primer sequences from reads, which is a crucial QC step for ASV calling. The structure of the primer-identification
directory contains three different files:
├── qc
├── sequence-categorisation
├── amplified-region-inference
├── primer-identification
├── ERR4334351_primers.fasta
├── ERR4334351.cutadapt.json
└── ERR4334351_primer_validation.tsv
├── asv
└── taxonomy-summary
- ERR4334351_primers.fasta: This
fasta
file contains the sequences of any identified primers that were then trimmed off usingcutadapt
. - ERR4334351.cutadapt.json: This
json
file contains the summary output of thecutadapt
run, including which primers were trimmed off, how many bases were trimmed off in the process, etc. Thisjson
file is also used later byMultiQC
to generate its report files. - ERR4334351_primer_validation.tsv: This
tsv
file contains the summary output of the primer validation module. Any primers that were trimmed off will have successfully been validated as primers usinginfernal/cmsearch
, and this output file summarises these findings.
The asv
directory contains output files related to the calling of ASVs and their abundances, which is mainly performed by DADA2
. The structure of the asv
directory contains four different files and at least one subdirectory containing one extra file:
├── qc
├── sequence-categorisation
├── amplified-region-inference
├── primer-identification
├── asv
│ ├── ERR4334351_asv_seqs.fasta
│ ├── ERR4334351_DADA2-SILVA_asv_tax.tsv
│ ├── ERR4334351_DADA2-PR2_asv_tax.tsv
│ ├── ERR4334351_dada2_stats.tsv
│ └── 16S-V3-V4
│ └── ERR4334351_16S-V3-V4_asv_read_counts.tsv
└── taxonomy-summary
The subdirectories are dynamic based on the inferred amplified region. As the pipeline allows for up to two different amplified regions, there are two potential scenarios and subdirectory output structures:
- One amplified region, which will give just one subdirectory as in the example above
- Two amplified regions, which will contain three subdirectories - one for each region, and a
concat
subdirectory that will contain the concatenation of both amplified regions
- ERR4334351_asv_seqs.fasta: This
fasta
file contains the sequences of the ASVs found byDADA2
. - ERR4334351_DADA2-SILVA_asv_tax.tsv: This
tsv
file contains the assigned taxonomy of every ASV, performed byMAPseq
, using the SILVA reference database. - ERR4334351_DADA2-PR2_asv_tax.tsv: This
tsv
file contains the assigned taxonomy of every ASV, performed byMAPseq
, using the PR2 reference database. - ERR4334351_dada2_stats.tsv: This
tsv
file contains some ASV-specific QC stats, such as the proportion of reads that were removed after filtering out chimeric ASVs. Thistsv
file is also used later byMultiQC
to generate its report files. - 16S-V3-V4/ERR4334351_16S-V3-V4_asv_read_counts.tsv: This
tsv
file contains the read counts for each ASV, and is specific to the inferred amplified region. As described previously, the naming of this output is dynamic based on the amount and identity of the inferred amplified region(s).
The taxonomy-summary
directory contains output files summarising the various taxonomic assignment results using tools like MAPseq
and Krona
. The pipeline uses four different reference databases to assign taxonomy, and the outputs in taxonomy-summary
are split based on the different reference databases:
- SILVA
- PR2
- UNITE
- ITSoneDB
However, there is added complexity in the output structure for two reasons:
- SILVA is actually split into two output directories: SILVA-SSU and SILVA-LSU.
- There are two kinds of taxonomic results for both SILVA-SSU and PR2: one for the hit matches and one for ASVs, with the latter being named as DADA2-SILVA and DADA2-PR2.
├── qc
├── sequence-categorisation
├── amplified-region-inference
├── primer-identification
├── asv
└── taxonomy-summary
├── SILVA-SSU
│ ├── ERR4334351.html
│ ├── ERR4334351_SILVA-SSU.mseq
│ ├── ERR4334351_SILVA-SSU.tsv
│ └── ERR4334351_SILVA-SSU.txt
├── PR2
│ ├── ERR4334351.html
│ ├── ERR4334351_PR2.mseq
│ ├── ERR4334351_PR2.tsv
│ └── ERR4334351_PR2.txt
├── UNITE
│ ├── ERR4334351.html
│ ├── ERR4334351_UNITE.mseq
│ ├── ERR4334351_UNITE.tsv
│ └── ERR4334351_UNITE.txt
└── ITSoneDB
├── ERR4334351.html
├── ERR4334351_ITSoneDB.mseq
├── ERR4334351_ITSoneDB.tsv
└── ERR4334351_ITSoneDB.txt
All of the different possible subdirectories have the same four files. Taking PR2 as an example:
- ERR4334351_PR2.mseq: This
mseq
file contains the raw MAPseq output for everyinfernal/cmsearch
match, i.e. each match's taxonomic assignment. - ERR4334351_PR2.txt: This
txt
file contains the Krona text input that is used to generate the Krona HTML file. It contains the distribution of the different taxonomic assignments. - ERR4334351.html: This
html
file contains the Krona HTML file that interactively displays the distribution of the different taxonomic assignments. - ERR4334351_UNITE.tsv: This
tsv
file contains the read count of every taxonomic assignment similar to the Krona txt file, but in a different easier-to-parse format.
├── qc
├── sequence-categorisation
├── amplified-region-inference
├── primer-identification
├── asv
└── taxonomy-summary
├── DADA2-SILVA
│ ├── ERR4334351_16S-V3-V4_DADA2-SILVA_asv_krona_counts.txt
│ ├── ERR4334351_16S-V3-V4.html
│ └── ERR4334351_DADA2-SILVA.mseq
└── DADA2-PR2
├── ERR4334351_16S-V3-V4_DADA2-PR2_asv_krona_counts.txt
├── ERR4334351_16S-V3-V4.html
└── ERR4334351_DADA2-PR2.mseq
The two different subdirectories have the same three categories of files, two of which are dynamic in naming. Using DADA2-PR2 as an example:
- ERR4334351_DADA2-PR2.mseq: This contains the raw MAPseq output for every ASV, i.e. each ASV's taxonomic assignment. This file is not dynamic.
- ERR4334351_16S-V3-V4_DADA2-PR2_asv_krona_counts.txt: This file contains the Krona text input that is used to generate the Krona HTML file for ASV results. This file is dynamic, as it will generate it on a per amplified region-basis. This means that in cases where there are two amplified regions, you will have three of these files - one for each reference database, and one for the concatenation of the two.
- ERR4334351_16S-V3-V4.html: This file contains the Krona HTMl file that interactively displays the distribution of the different taxonomic assignments for ASV results. This file is dynamic in the exact same way as the Krona text input file, i.e. based on the inferred amplified region(s).
The pipeline generated four different per-study output files that aggregate and summarise data, from failed runs to primer validation metadata.
The pipeline generates two MultiQC reports: one per-study (study_multiqc_report.html
), and one per-run (qc/${id}_multiqc_report.html
). These reports aggregate a few QC statistics from some of the tools run by the pipeline, including:
- fastp
- cutadapt
- DADA2 (as a custom report)
The pipeline runs a couple of sanity and QC checks on every input run. In the case where a run fails, it will be added to a top-level report (qc_failed_runs.csv
) that aggregates the IDs of any other run that failed, along with the particular reason it failed. For example:
ERR6093685,no_reads
ERRSFXHDFAIL,sfxhd_fail
ERRSEQFUFAIL,seqfu_fail
SRRLIBSTRATFAIL,libstrat_fail
The different exclusion messages are:
Exclusion message | Description |
---|---|
seqfu_fail |
Run had an error after running seqfu check . Check the log file in qc/${id}_seqfu.tsv for the exact reason |
sfxhd_fail |
Run had an error related to the suffix of the file _1/_2 not matching the headers inside the fastq file. Check the log file in qc/${id}_suffix_header_err.json for the reads at fault |
libstrat_fail |
Run was predicted to likely not be of AMPLICON sequencing based on base-conservation patterns at the beginning of reads |
no_reads |
Run had no reads left after running fastp |
Similarly to runs that fail QC, runs that pass QC are guaranteed to generate results. The IDs of such runs is aggregated into a top-level file (qc_passed_runs.csv
). For example:
SRR17062740,all_results
ERR4334351,all_results
ERRNOASVS,no_asvs
An important thing to note is that while a run might succeed at generating results for the closed-reference based method, it might fail at some extra QC checks required for generating results using the ASV method. For this reason, there are two statuses a passed run can have:
all_results
- if results for both methods could be generatedno_asvs
- if ASV results could not be generated
The pipeline performs inferrence of primer presence and sequence. For any runs where a primer was detected, metadata about it will be aggregated into a top-level primer validation summary file (primer_validation_summary.json
), including its sequence, region, and identification strategy. For example:
[
{
"id": "SRR17062740",
"primers": [
{
"name": "F_auto",
"region": "V4",
"strand": "fwd",
"sequence": "ATTCCAGCTCCAATAG",
"identification_strategy": "auto"
},
{
"name": "R_auto",
"region": "V4",
"strand": "rev",
"sequence": "GACTACGATGGTATNTAATC",
"identification_strategy": "auto"
}
]
},
{
"id": "ERR4334351",
"primers": [
{
"name": "341F",
"region": "V3",
"strand": "fwd",
"sequence": "CCTACGGGNGGCWGCAG",
"identification_strategy": "std"
},
{
"name": "805R",
"region": "V4",
"strand": "rev",
"sequence": "GACTACHVGGGTATCTAATCC",
"identification_strategy": "std"
}
]
}
]
The value of the identification_strategy
key can either be:
std
- Meaning the primer was matched to one of the standard library primers (more reliable)auto
- Meaning the primer was automatically predicted (less reliable)