Skip to content

benchmark

DrYak edited this page Nov 6, 2022 · 3 revisions

Warning: the documentation on this wiki is deprecated and refers to the benchmarking present in V-pipe 2.0 (published in Bioinformatics). A new benchmarking is being integrated into V-pipe ahead of version 3.0. You can find a preview of this up-coming feature in the readme of the benchmark auxiliary workflow.

V-pipe 2.0 as a benchmark tool

V-pipe also provides an unified benchmarking platform, by incorporating two additional modules: a read simulator and a module to evaluate the accuracy of the results.

simBench

We implemented five operating modes:

  1. Default Haplotype sequences are sampled from the leaves of a perfect binary tree. The root of the tree is used as the reference for the read alignment and it is ouput as a FASTA file in references/haplotype_master.fasta. Inner node sequences and leaves are generated from the predecessor with mutation, insertion and deletion rates configured by the user in the `samples.tsv file or equivalent.
  2. We generate a random sequence which we call the "master" haplotype sequence. By default, this sequence is output as a FASTA file in references/haplotype_master.fasta. Simulated haplotypes are generated from the master sequence with user-configurable mutation, insertion and deletion rates. Then, reads are generated from the set of underlying haplotype sequences. The master haplotype is used as reference for the read alignment.
  3. The user provides the master haplotype sequence and its location is specified as,
[input]
reference = `path/to/reference.fasta`

Haplotypes and reads are generated as above. 4. The user provides the set of underlying haplotypes, for instance, corresponding to known isolates. The location of the fasta file containing the haplotype sequences should be specified in the ``samples.tsv'' file or equivalent. 5. The user provides the FASTQ files containing the sequencing reads. This can be the case for control samples. These samples correspond to mock mixtures in which known isolates are mixed in the laboratory and, then, the sample is sequenced. In such case, the user needs to provide a file containing the sequences of the haplotypes and store it in <datadir>/<sample-ID>/<sample-data>/references/haplotypes/haplotypes.fasta. In addition, the user should indicate that haplotypes and reads do not need to be simulated,

[general]
simulate = False

testBench

We report true positives, false positives, false negatives, and true negatives per sample analysed. The ouput is located in the variants subfolder as SNV_calling_performance.tsv. In addtion, we report for each individual sample the frequencies of true positives, false positives, and false negatives, as well as the number of false negatives per haplotype. The ouput files are located in <datadir>/<sample-ID>/<sample-data>/variants/SNVs/.

vpipeBench

vpipeBench enables to run V-pipe including the simulation and testing modules,

path/to/V-pipeDir/init_project.sh -b
./vpipeBench

vpipeBenchRunner

vpipeBenchRunner allows to benchmark multiple pipeline configurations simultaneously. As input, we expect a vpipe.config file and a samples.tsv file. The reference sequence may be provided and its location can be specified via the configuration file. Additional FASTA files containing, e.g., haplotype sequences to simulate the in silico virus population should be located in a subfolder called references. We strongly advice to create conda environments prior execution of vpipeBenchRunner.

For cluster execution, we currently support two ways of passing cluster parameters: (i) using environment variable SNAKEMAKE_OPTS or (ii) using the configuration file.

Clone this wiki locally