- Abstract
- Motivation
- Description of Framework
- Installation
- Data Download
- Execution
- Documentation
- Contribute
- Citation
-
Synth Data Generation: Synthetic genomics data is generated based on the TP53 gene using the NEATv3.3 simulator in order to create synthetic datasets that mimic real cancer genome data. The "Ground Truth" is established by creating 10 individual datasets (each one of the same characteristics) containing Single Nucleotide Polymorphisms (SNPs) and Insertions/Deletions (INDELs). The genomic regions where variants accure with 100% Allele Frequency are chosen. The reason behind this choice is to avoid variants that are related to errors and products of noise. Then all these datasets are merged into one single file and the allele frequency is again measured at these genomic regions of interest.
-
Benchmarking Variant Callers: Somatic variant callers are evaluated using this synthetic Ground Truth dataset. GATK-Mutect2, Freebayes, VarDict, VarScan2 and LoFreq variant callers are assessed for their performance on our synthetic ground truth dataset. Their impact at low frequencies (≤10%) is explored, as these are particularly challenging to detect accurately.
The framework's overall aim is to provide a robust framework for evaluating the performance of tumor-only somatic variant calling algorithms by using synthetic datasets. By having a reliable ground truth, we can thoroughly test and improve the accuracy of variant calling algorithms for cancer genomics applications. This framework represents an essential step towards more precise and effective identification of genetic lesions associated with cancer and other diseases.
All data are open and available in Zenodo. For specific instructions please see our UserGuide.
- To create the conda environment that was used for the analysis run
conda env create -f environment.yml
and to activate it runconda activate synth4bench
. - To install NEATv3.3, download version v3.3. To call the main script run the command
python gen_reads.py --help
. For any further info please see the README.md file from the downloaded files of version 3.3. - To install bam-readcount follow their instructions and then run
build/bin/bam-readcount --help
to see that it has being installed properly. If you face any problems with the installation during themake
command please add the executable that can be found here in thebam-readcount\build\bin
folder. - To install R packages dependencies run this command
install.packages(c("stringr", "data.table", "vcfR", "ggplot2", "ggvenn", "ggforce", "ggsci", "patchwork"))
. - The extra script
vscan_pileup2cns2vcf.py
for VarScan can be found here.
For detailed instructions regarding the execution please read our UserGuide.
To run the bash scripts, fill in the parameters in the parameters.yaml
file and then run:
bash synth_generation_template.sh > desired_name.sh
andbash variant_calling_template.sh > desired_name.sh
Run the following commands to check the paramaters of each R script:
Rscript R/S4BR.R --help
Rscript R/S4BR_plot.R --help
For more info regarding the documentation please visit here.
We welcome and greatly appreciate any sort of feedback and/or contribution!
If you have any questions, please either open an issue here or write to us at sfragkoul@certh.gr
or inab.bioinformatics@lists.certh.gr
.
Our work has been submitted to the bioRxiv preprint repository. If you use our work please cite as follows:
S.-C. Fragkouli, N. Pechlivanis, A. Anastasiadou, G. Karakatsoulis, A. Orfanou, P. Kollia, A. Agathangelidis, and F. E. Psomopoulos, “Synth4bench: a framework for generating synthetic genomics data for the evaluation of tumor-only somatic variant calling algorithms.” 2024, doi:10.1101/2024.03.07.582313.
-
S.-C. Fragkouli, N. Pechlivanis, A. Anastasiadou, G. Karakatsoulis, A. Orfanou, P. Kollia, A. Agathangelidis, and F. Psomopoulos, synth4bench: Benchmarking Somatic Variant Callers A Tale Unfolding In The Synthetic Genomics Feature Space, 23rd European Conference On Computational Biology (ECCB24), Sep 2024, Turku, Finland doi: 10.5281/zenodo.14186509
-
S.-C. Fragkouli, N. Pechlivanis, A. Anastasiadou, G. Karakatsoulis, A. Orfanou, P. Kollia, A. Agathangelidis, and F. Psomopoulos, “Exploring Somatic Variant Callers' Behavior: A Synthetic Genomics Feature Space Approach”, ELIXIR AHM24, Jun 2024, Uppsala, Sweden, doi: 10.7490/f1000research.1119793.1
-
S.-C. Fragkouli, N. Pechlivanis, A. Orfanou, A. Anastasiadou, A. Agathangelidis and F. Psomopoulos, Synth4bench: a framework for generating synthetic genomics data for the evaluation of somatic variant calling algorithms, 17th Conference of Hellenic Society for Computational Biology and Bioinformatics (HSCBB), Oct 2023, Thessaloniki, Greece, doi:10.5281/zenodo.8432060
-
S.-C. Fragkouli, N. Pechlivanis, A. Agathangelidis and F. Psomopoulos, Synthetic Genomics Data Generation and Evaluation for the Use Case of Benchmarking Somatic Variant Calling Algorithms, 31st Conference in Intelligent Systems For Molecular Biology and the 22nd European Conference On Computational Biology (ISΜB-ECCB23), Jul 2023, Lyon, France doi:10.7490/f1000research.1119575.1